qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: Fabiano Rosas <farosas@suse.de>,
	qemu-devel@nongnu.org, armbru@redhat.com,
	Claudio Fontana <cfontana@suse.de>
Subject: Re: [PATCH v6 00/23] migration: File based migration with multifd and mapped-ram
Date: Tue, 5 Mar 2024 09:51:32 +0800	[thread overview]
Message-ID: <ZeZ6pI0O4-3ZQ10A@x1n> (raw)
In-Reply-To: <ZeY3c-zFV-i1mrrP@redhat.com>

On Mon, Mar 04, 2024 at 09:04:51PM +0000, Daniel P. Berrangé wrote:
> On Mon, Mar 04, 2024 at 05:15:05PM -0300, Fabiano Rosas wrote:
> > Peter Xu <peterx@redhat.com> writes:
> > 
> > > On Mon, Mar 04, 2024 at 08:53:24PM +0800, Peter Xu wrote:
> > >> On Mon, Mar 04, 2024 at 12:42:25PM +0000, Daniel P. Berrangé wrote:
> > >> > On Mon, Mar 04, 2024 at 08:35:36PM +0800, Peter Xu wrote:
> > >> > > Fabiano,
> > >> > > 
> > >> > > On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
> > >> > > > => guest: 128 GB RAM - 120 GB dirty - 1 vcpu in tight loop dirtying memory
> > >> > > 
> > >> > > I'm curious normally how much time does it take to do the final fdatasync()
> > >> > > for you when you did this test.
> > 
> > I measured and it takes ~4s for the live migration and ~2s for the
> > non-live. I didn't notice this before because the VM goes into
> > postmigrate, so it's paused anyway.

For my case it took me tens of seconds at least, if not go into minutes,
which I didn't measure.

I could have dirtied harder, or I just had a slower disk.  IIUC the worst
case is all cache dirty (didn't yet writeback in the kernel), say 100GB,
assuming the disk bandwidth 1GB/s (that's the bw of my test machine hard
drive of 1M chunk dd for a 10GB file, even without a sync..), IIUC it means
it could take 1min or more in reality.

> > 
> > >> > > 
> > >> > > I finally got a relatively large system today and gave it a quick shot over
> > >> > > 128G (100G busy dirty) mapped-ram snapshot with 8 multifd channels.  The
> > >> > > migration save/load does all fine, so I don't think there's anything wrong
> > >> > > with the patchset, however when save completes (I'll need to stop the
> > >> > > workload as my disk isn't fast enough I guess..) I'll always hit a super
> > >> > > long hang of QEMU on fdatasync() on XFS during which the main thread is in
> > >> > > UNINTERRUPTIBLE state.
> > >> > 
> > >> > That isn't very surprising. If you don't have O_DIRECT enabled, then
> > >> > all that disk I/O from the migrate is going to be in RAM, and thus the
> > >> > fdatasync() is likely to trigger writing out alot of data.
> > >> > 
> > >> > Blocking the main QEMU thread though is pretty unhelpful. That suggests
> > >> > the data sync needs to be moved to a non-main thread.
> > >> 
> > >> Perhaps migration thread itself can also be a candidate, then.
> > >> 
> > >> > 
> > >> > With O_DIRECT meanwhile there should be essentially no hit from fdatasync.
> > >> 
> > >> The update of COMPLETED status can be a good place of a marker point to
> > >> show such flush done if from the gut feeling of a user POV.  If that makes
> > >> sense, maybe we can do that sync before setting COMPLETED.
> > 
> > At the migration completion I believe the multifd threads will have
> > already cleaned up and dropped the reference to the channel, it might be
> > too late then.
> > 
> > In the multifd threads, we'll be wasting (like we are today) the extra
> > syscalls after the first sync succeeds.
> > 
> > >> 
> > >> No matter which thread does that sync, it's still a pity that it'll go into
> > >> UNINTERRUPTIBLE during fdatasync(), then whoever wants to e.g. attach a gdb
> > >> onto it to have a look will also hang.
> > >
> > > Or... would it be nicer we get rid of the fdatasync() but leave that for
> > > upper layers?  QEMU used to support file: migration already, it never
> > > manage cache behavior; it does smell like something shouldn't be done in
> > > QEMU when thinking about it, at least mapped-ram is nothing special to me
> > > from this regard.
> > >
> > > User should be able to control that either manually (sync), or Libvirt can
> > > do that after QEMU quits; after all Libvirt holds the fd itself?  It should
> > > allow us to get rid of above UNINTERRUPTIBLE / un-debuggable period of QEMU
> > > went away.  Another side benefit: rather than holding all of QEMU resources
> > > (especially, guest RAM) when waiting for a super slow disk flush, Libvirt /
> > > upper layer can do that separately after releasing all the QEMU resources
> > > first.
> > 
> > I like the idea of QEMU having a self-contained
> > implementation. Specially since we'll add O_DIRECT support, which is
> > already quite heavy-handed if we're talking about managing cache
> > behavior.

O_DIRECT is optionally selected by the user by setting the new parameter
first, so the user is still in full control - it's still user's decision on
how cache should be managed, even if QEMU needs explicit changes to support
and expose the new parameter.

For fdatasync(), I think it's slightly different in that it doesn't require
anything implemented in QEMU, as the snapshot is always in the form of a
file, and file is pretty common concept which well supports sync semantics
separately.  Instead of providing yet another parameter to control it, we
can just avoid that datasync.

Besides what I already described above as reasons, I think it's also legal
if an user wants to temporarily flush a VM into a disk (in paused state),
run some RAM-intense loads (which can immediately make use of guest's RAM
which is directly freed, but may _not_ always require a page cache flush),
then relaunch the VM.  In that case keeping some cache around might help
already to speedup relaunching to avoid unnecessary swap-ins/swap-outs.

> > 
> > However, it's not trivial to find the right place to add the sync.
> > Wherever we put it there will be some implications, such as ensuring the
> > sync works even after migration failure, avoiding concurrent cleanup,
> > etc.
> > 
> > In any case, I don't think it's correct to have the sync at
> > qio_channel_close(), now that we've seen it might block for a long
> > time. We could at the very least have a qio_channel_flush()[1] which the
> > QIOChannelFile implements with fdatasync(). Then the clients can choose
> > when to sync.
> 
> Yes, I agree with de-coupling it.

Yes, that decoupling makes sense to me.  That definitely answers some of my
previous confusions.

The following question is whether we should require a qio_channel_flush()
by default at anywhere around the end of migration for mapped-ram, in which
case I lean towards removing it completely.  In all cases, considering the
time it could hang qemu (possible in minutes) we may want to change that
behavior for 9.0 if possible.

Thanks,

-- 
Peter Xu



  reply	other threads:[~2024-03-05  1:52 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-29 15:29 [PATCH v6 00/23] migration: File based migration with multifd and mapped-ram Fabiano Rosas
2024-02-29 15:29 ` [PATCH v6 01/23] migration/multifd: Cleanup multifd_recv_sync_main Fabiano Rosas
2024-02-29 15:29 ` [PATCH v6 02/23] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file Fabiano Rosas
2024-02-29 15:29 ` [PATCH v6 03/23] io: Add generic pwritev/preadv interface Fabiano Rosas
2024-02-29 15:29 ` [PATCH v6 04/23] io: implement io_pwritev/preadv for QIOChannelFile Fabiano Rosas
2024-02-29 15:29 ` [PATCH v6 05/23] io: fsync before closing a file channel Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 06/23] migration/qemu-file: add utility methods for working with seekable channels Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 07/23] migration/ram: Introduce 'mapped-ram' migration capability Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 08/23] migration: Add mapped-ram URI compatibility check Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 09/23] migration/ram: Add outgoing 'mapped-ram' migration Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 10/23] migration/ram: Add incoming " Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 11/23] tests/qtest/migration: Add tests for mapped-ram file-based migration Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 12/23] migration/multifd: Rename MultiFDSend|RecvParams::data to compress_data Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 13/23] migration/multifd: Decouple recv method from pages Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 14/23] migration/multifd: Allow multifd without packets Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 15/23] migration/multifd: Allow receiving pages " Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 16/23] migration/multifd: Add a wrapper for channels_created Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 17/23] migration/multifd: Add outgoing QIOChannelFile support Fabiano Rosas
2024-03-01  1:43   ` Peter Xu
2024-02-29 15:30 ` [PATCH v6 18/23] migration/multifd: Add incoming " Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 19/23] migration/multifd: Prepare multifd sync for mapped-ram migration Fabiano Rosas
2024-03-01  1:45   ` Peter Xu
2024-02-29 15:30 ` [PATCH v6 20/23] migration/multifd: Support outgoing mapped-ram stream format Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 21/23] migration/multifd: Support incoming " Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 22/23] migration/multifd: Add mapped-ram support to fd: URI Fabiano Rosas
2024-03-01  1:47   ` Peter Xu
2024-02-29 15:30 ` [PATCH v6 23/23] tests/qtest/migration: Add a multifd + mapped-ram migration test Fabiano Rosas
2024-03-01  1:50 ` [PATCH v6 00/23] migration: File based migration with multifd and mapped-ram Peter Xu
2024-03-01  7:18   ` Markus Armbruster
2024-03-01  8:11   ` Daniel P. Berrangé
2024-03-01  8:37 ` Peter Xu
2024-03-04 12:35 ` Peter Xu
2024-03-04 12:42   ` Daniel P. Berrangé
2024-03-04 12:53     ` Peter Xu
2024-03-04 13:12       ` Peter Xu
2024-03-04 20:15         ` Fabiano Rosas
2024-03-04 21:04           ` Daniel P. Berrangé
2024-03-05  1:51             ` Peter Xu [this message]
2024-03-05 15:23               ` Fabiano Rosas
2024-03-04 13:09     ` Fabiano Rosas
2024-03-04 13:17       ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZeZ6pI0O4-3ZQ10A@x1n \
    --to=peterx@redhat.com \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=cfontana@suse.de \
    --cc=farosas@suse.de \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).