All of lore.kernel.org
 help / color / mirror / Atom feed
From: Fabiano Rosas <farosas@suse.de>
To: "Peter Xu" <peterx@redhat.com>,
	"Daniel P. Berrangé" <berrange@redhat.com>
Cc: qemu-devel@nongnu.org, armbru@redhat.com,
	Claudio Fontana <cfontana@suse.de>
Subject: Re: [PATCH v6 00/23] migration: File based migration with multifd and mapped-ram
Date: Tue, 05 Mar 2024 12:23:09 -0300	[thread overview]
Message-ID: <87bk7sitya.fsf@suse.de> (raw)
In-Reply-To: <ZeZ6pI0O4-3ZQ10A@x1n>

Peter Xu <peterx@redhat.com> writes:

> On Mon, Mar 04, 2024 at 09:04:51PM +0000, Daniel P. Berrangé wrote:
>> On Mon, Mar 04, 2024 at 05:15:05PM -0300, Fabiano Rosas wrote:
>> > Peter Xu <peterx@redhat.com> writes:
>> > 
>> > > On Mon, Mar 04, 2024 at 08:53:24PM +0800, Peter Xu wrote:
>> > >> On Mon, Mar 04, 2024 at 12:42:25PM +0000, Daniel P. Berrangé wrote:
>> > >> > On Mon, Mar 04, 2024 at 08:35:36PM +0800, Peter Xu wrote:
>> > >> > > Fabiano,
>> > >> > > 
>> > >> > > On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
>> > >> > > > => guest: 128 GB RAM - 120 GB dirty - 1 vcpu in tight loop dirtying memory
>> > >> > > 
>> > >> > > I'm curious normally how much time does it take to do the final fdatasync()
>> > >> > > for you when you did this test.
>> > 
>> > I measured and it takes ~4s for the live migration and ~2s for the
>> > non-live. I didn't notice this before because the VM goes into
>> > postmigrate, so it's paused anyway.
>
> For my case it took me tens of seconds at least, if not go into minutes,
> which I didn't measure.
>
> I could have dirtied harder, or I just had a slower disk.  IIUC the worst
> case is all cache dirty (didn't yet writeback in the kernel), say 100GB,
> assuming the disk bandwidth 1GB/s (that's the bw of my test machine hard
> drive of 1M chunk dd for a 10GB file, even without a sync..), IIUC it means
> it could take 1min or more in reality.
>
>> > 
>> > >> > > 
>> > >> > > I finally got a relatively large system today and gave it a quick shot over
>> > >> > > 128G (100G busy dirty) mapped-ram snapshot with 8 multifd channels.  The
>> > >> > > migration save/load does all fine, so I don't think there's anything wrong
>> > >> > > with the patchset, however when save completes (I'll need to stop the
>> > >> > > workload as my disk isn't fast enough I guess..) I'll always hit a super
>> > >> > > long hang of QEMU on fdatasync() on XFS during which the main thread is in
>> > >> > > UNINTERRUPTIBLE state.
>> > >> > 
>> > >> > That isn't very surprising. If you don't have O_DIRECT enabled, then
>> > >> > all that disk I/O from the migrate is going to be in RAM, and thus the
>> > >> > fdatasync() is likely to trigger writing out alot of data.
>> > >> > 
>> > >> > Blocking the main QEMU thread though is pretty unhelpful. That suggests
>> > >> > the data sync needs to be moved to a non-main thread.
>> > >> 
>> > >> Perhaps migration thread itself can also be a candidate, then.
>> > >> 
>> > >> > 
>> > >> > With O_DIRECT meanwhile there should be essentially no hit from fdatasync.
>> > >> 
>> > >> The update of COMPLETED status can be a good place of a marker point to
>> > >> show such flush done if from the gut feeling of a user POV.  If that makes
>> > >> sense, maybe we can do that sync before setting COMPLETED.
>> > 
>> > At the migration completion I believe the multifd threads will have
>> > already cleaned up and dropped the reference to the channel, it might be
>> > too late then.
>> > 
>> > In the multifd threads, we'll be wasting (like we are today) the extra
>> > syscalls after the first sync succeeds.
>> > 
>> > >> 
>> > >> No matter which thread does that sync, it's still a pity that it'll go into
>> > >> UNINTERRUPTIBLE during fdatasync(), then whoever wants to e.g. attach a gdb
>> > >> onto it to have a look will also hang.
>> > >
>> > > Or... would it be nicer we get rid of the fdatasync() but leave that for
>> > > upper layers?  QEMU used to support file: migration already, it never
>> > > manage cache behavior; it does smell like something shouldn't be done in
>> > > QEMU when thinking about it, at least mapped-ram is nothing special to me
>> > > from this regard.
>> > >
>> > > User should be able to control that either manually (sync), or Libvirt can
>> > > do that after QEMU quits; after all Libvirt holds the fd itself?  It should
>> > > allow us to get rid of above UNINTERRUPTIBLE / un-debuggable period of QEMU
>> > > went away.  Another side benefit: rather than holding all of QEMU resources
>> > > (especially, guest RAM) when waiting for a super slow disk flush, Libvirt /
>> > > upper layer can do that separately after releasing all the QEMU resources
>> > > first.
>> > 
>> > I like the idea of QEMU having a self-contained
>> > implementation. Specially since we'll add O_DIRECT support, which is
>> > already quite heavy-handed if we're talking about managing cache
>> > behavior.
>
> O_DIRECT is optionally selected by the user by setting the new parameter
> first, so the user is still in full control - it's still user's decision on
> how cache should be managed, even if QEMU needs explicit changes to support
> and expose the new parameter.
>
> For fdatasync(), I think it's slightly different in that it doesn't require
> anything implemented in QEMU, as the snapshot is always in the form of a
> file, and file is pretty common concept which well supports sync semantics
> separately.  Instead of providing yet another parameter to control it, we
> can just avoid that datasync.
>
> Besides what I already described above as reasons, I think it's also legal
> if an user wants to temporarily flush a VM into a disk (in paused state),
> run some RAM-intense loads (which can immediately make use of guest's RAM
> which is directly freed, but may _not_ always require a page cache flush),
> then relaunch the VM.  In that case keeping some cache around might help
> already to speedup relaunching to avoid unnecessary swap-ins/swap-outs.
>
>> > 
>> > However, it's not trivial to find the right place to add the sync.
>> > Wherever we put it there will be some implications, such as ensuring the
>> > sync works even after migration failure, avoiding concurrent cleanup,
>> > etc.
>> > 
>> > In any case, I don't think it's correct to have the sync at
>> > qio_channel_close(), now that we've seen it might block for a long
>> > time. We could at the very least have a qio_channel_flush()[1] which the
>> > QIOChannelFile implements with fdatasync(). Then the clients can choose
>> > when to sync.
>> 
>> Yes, I agree with de-coupling it.
>
> Yes, that decoupling makes sense to me.  That definitely answers some of my
> previous confusions.
>
> The following question is whether we should require a qio_channel_flush()
> by default at anywhere around the end of migration for mapped-ram, in which
> case I lean towards removing it completely.  In all cases, considering the
> time it could hang qemu (possible in minutes) we may want to change that
> behavior for 9.0 if possible.

Ok, I'll remove it for 9.0 then. And I guess I'll also remove the flush
completely since there are no other users except for migration.


  reply	other threads:[~2024-03-05 15:24 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-29 15:29 [PATCH v6 00/23] migration: File based migration with multifd and mapped-ram Fabiano Rosas
2024-02-29 15:29 ` [PATCH v6 01/23] migration/multifd: Cleanup multifd_recv_sync_main Fabiano Rosas
2024-02-29 15:29 ` [PATCH v6 02/23] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file Fabiano Rosas
2024-02-29 15:29 ` [PATCH v6 03/23] io: Add generic pwritev/preadv interface Fabiano Rosas
2024-02-29 15:29 ` [PATCH v6 04/23] io: implement io_pwritev/preadv for QIOChannelFile Fabiano Rosas
2024-02-29 15:29 ` [PATCH v6 05/23] io: fsync before closing a file channel Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 06/23] migration/qemu-file: add utility methods for working with seekable channels Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 07/23] migration/ram: Introduce 'mapped-ram' migration capability Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 08/23] migration: Add mapped-ram URI compatibility check Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 09/23] migration/ram: Add outgoing 'mapped-ram' migration Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 10/23] migration/ram: Add incoming " Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 11/23] tests/qtest/migration: Add tests for mapped-ram file-based migration Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 12/23] migration/multifd: Rename MultiFDSend|RecvParams::data to compress_data Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 13/23] migration/multifd: Decouple recv method from pages Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 14/23] migration/multifd: Allow multifd without packets Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 15/23] migration/multifd: Allow receiving pages " Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 16/23] migration/multifd: Add a wrapper for channels_created Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 17/23] migration/multifd: Add outgoing QIOChannelFile support Fabiano Rosas
2024-03-01  1:43   ` Peter Xu
2024-02-29 15:30 ` [PATCH v6 18/23] migration/multifd: Add incoming " Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 19/23] migration/multifd: Prepare multifd sync for mapped-ram migration Fabiano Rosas
2024-03-01  1:45   ` Peter Xu
2024-02-29 15:30 ` [PATCH v6 20/23] migration/multifd: Support outgoing mapped-ram stream format Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 21/23] migration/multifd: Support incoming " Fabiano Rosas
2024-02-29 15:30 ` [PATCH v6 22/23] migration/multifd: Add mapped-ram support to fd: URI Fabiano Rosas
2024-03-01  1:47   ` Peter Xu
2024-02-29 15:30 ` [PATCH v6 23/23] tests/qtest/migration: Add a multifd + mapped-ram migration test Fabiano Rosas
2024-03-01  1:50 ` [PATCH v6 00/23] migration: File based migration with multifd and mapped-ram Peter Xu
2024-03-01  7:18   ` Markus Armbruster
2024-03-01  8:11   ` Daniel P. Berrangé
2024-03-01  8:37 ` Peter Xu
2024-03-04 12:35 ` Peter Xu
2024-03-04 12:42   ` Daniel P. Berrangé
2024-03-04 12:53     ` Peter Xu
2024-03-04 13:12       ` Peter Xu
2024-03-04 20:15         ` Fabiano Rosas
2024-03-04 21:04           ` Daniel P. Berrangé
2024-03-05  1:51             ` Peter Xu
2024-03-05 15:23               ` Fabiano Rosas [this message]
2024-03-04 13:09     ` Fabiano Rosas
2024-03-04 13:17       ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bk7sitya.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=cfontana@suse.de \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.