From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org, berrange@redhat.com, armbru@redhat.com,
Claudio Fontana <cfontana@suse.de>, Jim Fehlig <jfehlig@suse.com>,
Thomas Huth <thuth@redhat.com>,
Laurent Vivier <lvivier@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration
Date: Mon, 10 Jun 2024 14:45:53 -0300 [thread overview]
Message-ID: <87r0d4wv1q.fsf@suse.de> (raw)
In-Reply-To: <ZmclVQw0x7KKLxmF@x1n>
Peter Xu <peterx@redhat.com> writes:
> On Fri, Jun 07, 2024 at 03:42:35PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Thu, May 23, 2024 at 04:05:48PM -0300, Fabiano Rosas wrote:
>> >> We've recently added support for direct-io with multifd, which brings
>> >> performance benefits, but creates a non-uniform user interface by
>> >> coupling direct-io with the multifd capability. This means that users
>> >> cannot keep the direct-io flag enabled while disabling multifd.
>> >>
>> >> Libvirt in particular already has support for direct-io and parallel
>> >> migration separately from each other, so it would be a regression to
>> >> now require both options together. It's relatively simple for QEMU to
>> >> add support for direct-io migration without multifd, so let's do this
>> >> in order to keep both options decoupled.
>> >>
>> >> We cannot simply enable the O_DIRECT flag, however, because not all IO
>> >> performed by the migration thread satisfies the alignment requirements
>> >> of O_DIRECT. There are many small read & writes that add headers and
>> >> synchronization flags to the stream, which at the moment are required
>> >> to always be present.
>> >>
>> >> Fortunately, due to fixed-ram migration there is a discernible moment
>> >> where only RAM pages are written to the migration file. Enable
>> >> direct-io during that moment.
>> >>
>> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >
>> > Is anyone going to consume this? How's the performance?
>>
>> I don't think we have a pre-determined consumer for this. This came up
>> in an internal discussion about making the interface simpler for libvirt
>> and in a thread on the libvirt mailing list[1] about using O_DIRECT to
>> keep the snapshot data out of the caches to avoid impacting the rest of
>> the system. (I could have described this better in the commit message,
>> sorry).
>>
>> Quoting Daniel:
>>
>> "Note the reason for using O_DIRECT is *not* to make saving / restoring
>> the guest VM faster. Rather it is to ensure that saving/restoring a VM
>> does not trash the host I/O / buffer cache, which will negatively impact
>> performance of all the *other* concurrently running VMs."
>>
>> 1- https://lore.kernel.org/r/87sez86ztq.fsf@suse.de
>>
>> About performance, a quick test on a stopped 30G guest, shows
>> mapped-ram=on direct-io=on it's 12% slower than mapped-ram=on
>> direct-io=off.
>
> Yes, this makes sense.
>
>>
>> >
>> > It doesn't look super fast to me if we need to enable/disable dio in each
>> > loop.. then it's a matter of whether we should bother, or would it be
>> > easier that we simply require multifd when direct-io=on.
>>
>> AIUI, the issue here that users are already allowed to specify in
>> libvirt the equivalent to direct-io and multifd independent of each
>> other (bypass-cache, parallel). To start requiring both together now in
>> some situations would be a regression. I confess I don't know libvirt
>> code to know whether this can be worked around somehow, but as I said,
>> it's a relatively simple change from the QEMU side.
>
> Firstly, I definitely want to already avoid all the calls to either
> migration_direct_io_start() or *_finish(), now we already need to
> explicitly call them in three paths, and that's not intuitive and less
> readable, just like the hard coded rdma codes.
Right, but that's just a side-effect of how the code is structured and
the fact that writes to the stream happen in small chunks. Setting
O_DIRECT needs to happen around aligned IO. We could move the calls
further down into qemu_put_buffer_at(), but that would be four fcntl()
calls for every page.
A tangent:
one thing that occured to me now is that we may be able to restrict
calls to qemu_fflush() to internal code like add_to_iovec() and maybe
use that function to gather the correct amount of data before writing,
making sure it disables O_DIRECT in case alignment is about to be
broken?
>
> I also worry we may overlook the complexity here, and pinning buffers
> definitely need more thoughts on its own. It's easier to digest when using
> multifd and when QEMU only pins guest pages just like tcp-zerocopy does,
> which are naturally host page size aligned, and also guaranteed to not be
> freed (while reused / modified is fine here, as dirty tracking guarantees a
> new page will be migrated soon again).
I don't get this at all, sorry. What is different from multifd here?
We're writing on the same HVA as the one that would be given to multifd
(if it were enabled) and dirty tracking is working the same.
> IMHO here the "not be freed / modified" is even more important than
> "alignment": the latter is about perf, the former is about correctness.
> When we do directio on random buffers, AFAIU we don't want to have the
> buffer modified before flushed to disk, and that's IMHO not easy to
> guarantee.
>
> E.g., I don't think this guarantees a flush on the buffer usages:
>
> migration_direct_io_start()
> /* flush any potentially unaligned IO before setting O_DIRECT */
> qemu_fflush(file);
>
> qemu_fflush() internally does writev(), and that "flush" is about "flushing
> qemufile iov[] to fd", not "flushing buffers to disk". I think it means
> if we do qemu_fflush() then we modify QEMUFile.buf[IO_BUF_SIZE] we're
> doomed: we will never know whether dio has happened, and which version of
> buffer will be sent; I don't think it's guaranteed it will always be the
> old version of the buffer.
>
> However the issue is, QEMUFile defines qemu_fflush() as: after call, the
> buf[] can be reused! It suggests breaking things I guess in dio context.
I think you're mixing the usage of qemu_put_byte()/qemu_put_buffer()
with the usage of qemu_put_buffer_at(). The former two use the
QEMUFile.buf without O_DIRECT and the latter writes directly to the fd
at the page offset. So there's no issue in reusing buf before writes
have reached the disk. All writes going through buf are serialized and
all writes going through qio_channel_pwrite() go to a different offset.
I included all of these assert(!f->dio) to ensure that we don't use the
two APIs incorrectly. Mainly that we don't try to write to buf while
O_DIRECT is set.
>
> IIUC currently mapped-ram is ok because mapped-ram is just special that it
> doesn't have page headers, so it doesn't use the buf[] during iterations;
> while for zeropage it uses file_bmap bitmap and that's separate too and
> does not generate any byte on the wire either.
Right. This is all mapped-ram. I'm not proposing to enable O_DIRECT for
any migration.
>
> xbzrle could use that buf[], but maybe mapped-ram doesn't work anyway with
> xbzrle.
>
> Everything is just very not obvious and tricky to me. This still looks
> pretty dangerous to me. Would migration_direct_io_finish() guarantee
> something like a fdatasync()? If so it looks safer, but still within the
> start() and finish() if someone calls qemu_fflush() and reuse the buffer we
> can still get hard to debug issues (as the outcome would be that we saw
> corrupted migration files).
>
>>
>> Another option which would be for libvirt to keep using multifd, but
>> make it 1 channel only if --parallel is not specified. That might be
>> enough to solve the interface issues. Of course, it's a different code
>> altogether than the usual precopy code that gets executed when
>> multifd=off, I don't know whether that could be an issue somehow.
>
> Would there be any comment from Libvirt side? This sounds like a good
> solution if my above concern is real; as long as we always stick dio with
> guest pages we'll be all fine.
>
> Thanks,
next prev parent reply other threads:[~2024-06-10 17:47 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
2024-05-23 19:05 ` [PATCH v2 01/18] migration: Fix file migration with fdset Fabiano Rosas
2024-05-24 10:51 ` Prasad Pandit
2024-05-24 12:30 ` Fabiano Rosas
2024-05-25 6:16 ` Prasad Pandit
2024-05-30 16:11 ` Peter Xu
2024-05-31 14:58 ` Fabiano Rosas
2024-06-03 10:20 ` Daniel P. Berrangé
2024-05-23 19:05 ` [PATCH v2 02/18] tests/qtest/migration: Fix file migration offset check Fabiano Rosas
2024-05-30 16:14 ` Peter Xu
2024-06-03 10:21 ` Daniel P. Berrangé
2024-05-23 19:05 ` [PATCH v2 03/18] tests/qtest/migration: Add a precopy file test with fdset Fabiano Rosas
2024-05-30 16:18 ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 04/18] monitor: Drop monitor_fdset_dup_fd_add() Fabiano Rosas
2024-06-03 10:26 ` Daniel P. Berrangé
2024-05-23 19:05 ` [PATCH v2 05/18] monitor: Introduce monitor_fdset_*free Fabiano Rosas
2024-05-30 20:03 ` Peter Xu
2024-05-31 15:01 ` Fabiano Rosas
2024-05-23 19:05 ` [PATCH v2 06/18] monitor: Stop removing non-duplicated fds Fabiano Rosas
2024-05-30 21:05 ` Peter Xu
2024-05-31 15:25 ` Fabiano Rosas
2024-05-31 15:56 ` Peter Xu
2024-06-04 23:40 ` Dr. David Alan Gilbert
2024-06-05 12:31 ` Fabiano Rosas
2024-05-23 19:05 ` [PATCH v2 07/18] monitor: Simplify fdset and fd removal Fabiano Rosas
2024-05-31 15:58 ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 08/18] monitor: Report errors from monitor_fdset_dup_fd_add Fabiano Rosas
2024-05-30 21:08 ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 09/18] io: Stop using qemu_open_old in channel-file Fabiano Rosas
2024-05-30 21:10 ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 10/18] migration: Add direct-io parameter Fabiano Rosas
2024-05-30 21:12 ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 11/18] migration/multifd: Add direct-io support Fabiano Rosas
2024-05-30 21:35 ` Peter Xu
2024-05-31 15:27 ` Fabiano Rosas
2024-05-23 19:05 ` [PATCH v2 12/18] tests/qtest/migration: Add tests for file migration with direct-io Fabiano Rosas
2024-05-23 19:05 ` [PATCH v2 13/18] monitor: fdset: Match against O_DIRECT Fabiano Rosas
2024-05-30 21:41 ` Peter Xu
2024-05-31 15:42 ` Fabiano Rosas
2024-05-31 15:58 ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 14/18] migration: Add documentation for fdset with multifd + file Fabiano Rosas
2024-06-04 20:46 ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 15/18] tests/qtest/migration: Add a test for mapped-ram with passing of fds Fabiano Rosas
2024-06-04 20:51 ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 16/18] io/channel-file: Add direct-io support Fabiano Rosas
2024-06-03 10:32 ` Daniel P. Berrangé
2024-05-23 19:05 ` [PATCH v2 17/18] migration: Add direct-io helpers Fabiano Rosas
2024-05-23 19:05 ` [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration Fabiano Rosas
2024-06-04 20:56 ` Peter Xu
2024-06-07 18:42 ` Fabiano Rosas
2024-06-07 20:39 ` Jim Fehlig
2024-06-10 16:09 ` Peter Xu
2024-06-10 17:45 ` Fabiano Rosas [this message]
2024-06-10 19:02 ` Peter Xu
2024-06-10 19:07 ` Daniel P. Berrangé
2024-06-10 20:12 ` Fabiano Rosas
2024-06-12 18:08 ` Fabiano Rosas
2024-06-12 18:15 ` Daniel P. Berrangé
2024-06-12 18:27 ` Peter Xu
2024-06-12 18:44 ` Fabiano Rosas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87r0d4wv1q.fsf@suse.de \
--to=farosas@suse.de \
--cc=armbru@redhat.com \
--cc=berrange@redhat.com \
--cc=cfontana@suse.de \
--cc=jfehlig@suse.com \
--cc=lvivier@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=thuth@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.