qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org, berrange@redhat.com, armbru@redhat.com,
	Juan Quintela <quintela@redhat.com>,
	Leonardo Bras <leobras@redhat.com>,
	Claudio Fontana <cfontana@suse.de>
Subject: Re: [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec
Date: Wed, 17 Jan 2024 15:06:15 -0300	[thread overview]
Message-ID: <87fryvdeco.fsf@suse.de> (raw)
In-Reply-To: <ZaeiXra5hLSo0jnt@x1n>

Peter Xu <peterx@redhat.com> writes:

> On Tue, Jan 16, 2024 at 03:15:50PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Mon, Nov 27, 2023 at 05:25:57PM -0300, Fabiano Rosas wrote:
>> >> For the upcoming support to fixed-ram migration with multifd, we need
>> >> to be able to accept an iovec array with non-contiguous data.
>> >> 
>> >> Add a pwritev and preadv version that splits the array into contiguous
>> >> segments before writing. With that we can have the ram code continue
>> >> to add pages in any order and the multifd code continue to send large
>> >> arrays for reading and writing.
>> >> 
>> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >> ---
>> >> - split the API that was merged into a single function
>> >> - use uintptr_t for compatibility with 32-bit
>> >> ---
>> >>  include/io/channel.h | 26 ++++++++++++++++
>> >>  io/channel.c         | 70 ++++++++++++++++++++++++++++++++++++++++++++
>> >>  2 files changed, 96 insertions(+)
>> >> 
>> >> diff --git a/include/io/channel.h b/include/io/channel.h
>> >> index 7986c49c71..25383db5aa 100644
>> >> --- a/include/io/channel.h
>> >> +++ b/include/io/channel.h
>> >> @@ -559,6 +559,19 @@ int qio_channel_close(QIOChannel *ioc,
>> >>  ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
>> >>                              size_t niov, off_t offset, Error **errp);
>> >>  
>> >> +/**
>> >> + * qio_channel_pwritev_all:
>> >> + * @ioc: the channel object
>> >> + * @iov: the array of memory regions to write data from
>> >> + * @niov: the length of the @iov array
>> >> + * @offset: the iovec offset in the file where to write the data
>> >> + * @errp: pointer to a NULL-initialized error object
>> >> + *
>> >> + * Returns: 0 if all bytes were written, or -1 on error
>> >> + */
>> >> +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
>> >> +                            size_t niov, off_t offset, Error **errp);
>> >> +
>> >>  /**
>> >>   * qio_channel_pwrite
>> >>   * @ioc: the channel object
>> >> @@ -595,6 +608,19 @@ ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
>> >>  ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
>> >>                             size_t niov, off_t offset, Error **errp);
>> >>  
>> >> +/**
>> >> + * qio_channel_preadv_all:
>> >> + * @ioc: the channel object
>> >> + * @iov: the array of memory regions to read data to
>> >> + * @niov: the length of the @iov array
>> >> + * @offset: the iovec offset in the file from where to read the data
>> >> + * @errp: pointer to a NULL-initialized error object
>> >> + *
>> >> + * Returns: 0 if all bytes were read, or -1 on error
>> >> + */
>> >> +int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
>> >> +                           size_t niov, off_t offset, Error **errp);
>> >> +
>> >>  /**
>> >>   * qio_channel_pread
>> >>   * @ioc: the channel object
>> >> diff --git a/io/channel.c b/io/channel.c
>> >> index a1f12f8e90..2f1745d052 100644
>> >> --- a/io/channel.c
>> >> +++ b/io/channel.c
>> >> @@ -472,6 +472,69 @@ ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
>> >>      return klass->io_pwritev(ioc, iov, niov, offset, errp);
>> >>  }
>> >>  
>> >> +static int qio_channel_preadv_pwritev_contiguous(QIOChannel *ioc,
>> >> +                                                 const struct iovec *iov,
>> >> +                                                 size_t niov, off_t offset,
>> >> +                                                 bool is_write, Error **errp)
>> >> +{
>> >> +    ssize_t ret = -1;
>> >> +    int i, slice_idx, slice_num;
>> >> +    uintptr_t base, next, file_offset;
>> >> +    size_t len;
>> >> +
>> >> +    slice_idx = 0;
>> >> +    slice_num = 1;
>> >> +
>> >> +    /*
>> >> +     * If the iov array doesn't have contiguous elements, we need to
>> >> +     * split it in slices because we only have one (file) 'offset' for
>> >> +     * the whole iov. Do this here so callers don't need to break the
>> >> +     * iov array themselves.
>> >> +     */
>> >> +    for (i = 0; i < niov; i++, slice_num++) {
>> >> +        base = (uintptr_t) iov[i].iov_base;
>> >> +
>> >> +        if (i != niov - 1) {
>> >> +            len = iov[i].iov_len;
>> >> +            next = (uintptr_t) iov[i + 1].iov_base;
>> >> +
>> >> +            if (base + len == next) {
>> >> +                continue;
>> >> +            }
>> >> +        }
>> >> +
>> >> +        /*
>> >> +         * Use the offset of the first element of the segment that
>> >> +         * we're sending.
>> >> +         */
>> >> +        file_offset = offset + (uintptr_t) iov[slice_idx].iov_base;
>> >> +
>> >> +        if (is_write) {
>> >> +            ret = qio_channel_pwritev(ioc, &iov[slice_idx], slice_num,
>> >> +                                      file_offset, errp);
>> >> +        } else {
>> >> +            ret = qio_channel_preadv(ioc, &iov[slice_idx], slice_num,
>> >> +                                     file_offset, errp);
>> >> +        }
>> >> +
>> >> +        if (ret < 0) {
>> >> +            break;
>> >> +        }
>> >> +
>> >> +        slice_idx += slice_num;
>> >> +        slice_num = 0;
>> >> +    }
>> >> +
>> >> +    return (ret < 0) ? -1 : 0;
>> >> +}
>> >> +
>> >> +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
>> >> +                            size_t niov, off_t offset, Error **errp)
>> >> +{
>> >> +    return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
>> >> +                                                 offset, true, errp);
>> >> +}
>> >
>> > I'm not sure how Dan thinks about this, but I don't think this is pretty..
>> >
>> > With this implementation, iochannels' preadv/pwritev is completely not
>> > compatible with most OSes now, afaiu.
>> 
>> This is internal QEMU code. I hope no one is expecting qio_channel_foo()
>> to behave like some OS's foo() system call. We cannot guarantee that
>> compatibility save for the simplest of wrappers.
>
> I was expecting that when I started to read. :)
>
> https://man.freebsd.org/cgi/man.cgi?query=pwritev
> https://linux.die.net/man/2/pwritev
>
> It's not "some OSes", it's mostly all.

What I mean is no one would ever replace a call to pwritev() with
qio_channel_pwritev() and expect the same behvior. We're not writing a
libc.

> I can understand you prefer such
> approach, but even if so, shall we still try to avoid using pwritev/preadv
> as the names?
>

Yes, it's probably better to avoid those if we're going to be doing any
extra operations.

>> 
>> >
>> > The definition of offset in preadv/pwritev of current iochannel is hard to
>> > understand.. if I read it right it'll later be set to:
>> >       
>> >                 /*
>> >                  * If we subtract the host page now, we don't need to
>> >                  * pass it into qio_channel_pwritev_all() below.
>> >                  */
>> >                 write_base = p->pages->block->pages_offset -
>> >                     (uintptr_t)p->pages->block->host;
>> >
>> > Which I cannot easily tell what it is.. besides being an unsigned int.
>> 
>> This description was unfortunately dropped along the way:
>> 
>> "Since iovs can be non contiguous, we'd need a separate array on the
>> side to carry an extra file offset for each of them, so I'm relying on
>> the fact that iovs are all within a same host page and passing in an
>> encoded offset that takes the host page into account."
>> 
>> > IIUC it's also based on the assumption that the host address of each iov
>> > entry is linear to its offset in the file, but it may not be true for
>> > future iochannel users of such interface called as pwritev/preadv.  So
>> > error prone.
>> 
>> Yes, but it's also our choice whether to make this a generic API. We may
>> have good reasons to consider a migration-specific function here.
>> 
>> > Would it be possible we keep using the offset array (p->pages->offset[x])?
>> > We have it already anyway, right?  Wouldn't that be clearer?
>> >
>> 
>> We'd have to make a copy of the array because p->pages is expected to
>> change while the IO happens.
>
> Hmm, I don't see why p->pages can change. IIUC p->pages will be there solid
> at least until all IO syscalls are completed, then the next call to, e.g.,
> multifd_send_pages() will swap that with multifd_send_state->pages.  But I
> think I get your point, with below.

Oh no, you're right. Because of p->pending_job. And thinking about
p->pending_job, wouldn't a trylock to the same job while being more
explicit?

    next_channel %= migrate_multifd_channels();
    for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
        p = &multifd_send_state->params[i];

        if(qemu_mutex_trylock(&p->mutex)) {
            if (p->quit) {
                error_report("%s: channel %d has already quit!", __func__, i);
                qemu_mutex_unlock(&p->mutex);
                return -1;
            }
            next_channel = (i + 1) % migrate_multifd_channels();
            break;
        } else {
            /* channel still busy, try the next one */
        }
    }
    multifd_send_state->pages = p->pages;
    p->pages = pages;
    qemu_mutex_unlock(&p->mutex);

>> And while we already have a copy in
>> p->normal, my intention for multifd was to eliminate p->normal in the
>> future, so it would be nice if we could avoid it.
>> 
>> Also, we cannot use p->pages->offset alone because we still need the
>> pages_offset, i.e. the file offset where that ramblocks's pages begin.
>> So that means also adding that to each element of the new array.
>> 
>> It would probably be overall clearer and less wasteful to pass in the
>> host page address instead of an array of offsets. I don't see an issue
>> with restricting the iovs to the same host page. The migration code is
>> the only user for this code and AFAIK we don't have plans to change that
>> invariant.
>
> So I think I get your point now, the only concern (besides naming..) is,
> I still want to avoid an interface that contains a field that is hard to
> understand like write_base.
>
> How about this?
>
>   /**
>    * multifd_write_ramblock_iov: Write IO vector (of ramblock) to channel
>    *
>    * @ioc: The iochannel to write to. The IOC must have pwritev/preadv
>    *       interface must be implemented.
>    * @iov: The IO vector to write.  All addresses must be within the
>    *       ramblock host address range.
>    * @iov_len: The IO vector size
>    * @ramblock: The ramblock that covers all buffers in this IO vector
>    */
>   int multifd_write_ramblock_iov(ioc, iov, iov_len, ramblock);

Ok, then I can take block->pages_offset and block->host from the
ramblock. I think I prefer something like this, that way we can be
explicit about the migration assumptions.

Thanks!


  reply	other threads:[~2024-01-17 18:06 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
2023-11-27 20:25 ` [RFC PATCH v3 01/30] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file Fabiano Rosas
2024-01-10  8:49   ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 02/30] io: Add generic pwritev/preadv interface Fabiano Rosas
2024-01-10  9:07   ` Daniel P. Berrangé
2024-01-11  6:59   ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 03/30] io: implement io_pwritev/preadv for QIOChannelFile Fabiano Rosas
2024-01-10  9:08   ` Daniel P. Berrangé
2024-01-11  7:04   ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 04/30] io: fsync before closing a file channel Fabiano Rosas
2024-01-10  9:04   ` Daniel P. Berrangé
2024-01-11  8:44   ` Peter Xu
2024-01-11 18:46     ` Fabiano Rosas
2024-01-12  0:01       ` Peter Xu
2024-01-12 10:40         ` Daniel P. Berrangé
2024-01-15  3:38           ` Peter Xu
2024-01-15  8:57       ` Peter Xu
2024-01-15  9:03         ` Daniel P. Berrangé
2024-01-15  9:31           ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 05/30] migration/qemu-file: add utility methods for working with seekable channels Fabiano Rosas
2024-01-11  9:57   ` Peter Xu
2024-01-11 18:49     ` Fabiano Rosas
2023-11-27 20:25 ` [RFC PATCH v3 06/30] migration/ram: Introduce 'fixed-ram' migration capability Fabiano Rosas
2023-12-22 10:35   ` Markus Armbruster
2024-01-11 10:43   ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 07/30] migration: Add fixed-ram URI compatibility check Fabiano Rosas
2024-01-15  9:01   ` Peter Xu
2024-01-23 19:07     ` Fabiano Rosas
2024-01-23 19:07     ` Fabiano Rosas
2023-11-27 20:25 ` [RFC PATCH v3 08/30] migration/ram: Add outgoing 'fixed-ram' migration Fabiano Rosas
2024-01-15  9:28   ` Peter Xu
2024-01-15 14:50     ` Fabiano Rosas
2023-11-27 20:25 ` [RFC PATCH v3 09/30] migration/ram: Add incoming " Fabiano Rosas
2024-01-15  9:49   ` Peter Xu
2024-01-15 16:43     ` Fabiano Rosas
2023-11-27 20:25 ` [RFC PATCH v3 10/30] tests/qtest: migration-test: Add tests for fixed-ram file-based migration Fabiano Rosas
2024-01-15 10:01   ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 11/30] migration/multifd: Allow multifd without packets Fabiano Rosas
2024-01-15 11:51   ` Peter Xu
2024-01-15 18:39     ` Fabiano Rosas
2024-01-15 23:01       ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 12/30] migration/multifd: Allow QIOTask error reporting without an object Fabiano Rosas
2024-01-15 12:06   ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 13/30] migration/multifd: Add outgoing QIOChannelFile support Fabiano Rosas
2024-01-16  4:05   ` Peter Xu
2024-01-16  7:25     ` Peter Xu
2024-01-16 13:37     ` Fabiano Rosas
2024-01-17  8:28       ` Peter Xu
2024-01-17 17:34         ` Fabiano Rosas
2024-01-18  7:11           ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 14/30] migration/multifd: Add incoming " Fabiano Rosas
2024-01-16  6:29   ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec Fabiano Rosas
2024-01-16  6:58   ` Peter Xu
2024-01-16 18:15     ` Fabiano Rosas
2024-01-17  9:48       ` Peter Xu
2024-01-17 18:06         ` Fabiano Rosas [this message]
2024-01-18  7:44           ` Peter Xu
2024-01-18 12:47             ` Fabiano Rosas
2024-01-19  0:22               ` Peter Xu
2024-01-17 12:39   ` Daniel P. Berrangé
2024-01-17 14:27     ` Daniel P. Berrangé
2024-01-17 18:09       ` Fabiano Rosas
2023-11-27 20:25 ` [RFC PATCH v3 16/30] multifd: Rename MultiFDSendParams::data to compress_data Fabiano Rosas
2024-01-16  7:03   ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 17/30] migration/multifd: Decouple recv method from pages Fabiano Rosas
2024-01-16  7:23   ` Peter Xu
2023-11-27 20:26 ` [RFC PATCH v3 18/30] migration/multifd: Allow receiving pages without packets Fabiano Rosas
2024-01-16  8:10   ` Peter Xu
2024-01-16 20:25     ` Fabiano Rosas
2024-01-19  0:20       ` Peter Xu
2024-01-19 12:57         ` Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 19/30] migration/ram: Ignore multifd flush when doing fixed-ram migration Fabiano Rosas
2024-01-16  8:23   ` Peter Xu
2024-01-17 18:13     ` Fabiano Rosas
2024-01-19  1:33       ` Peter Xu
2023-11-27 20:26 ` [RFC PATCH v3 20/30] migration/multifd: Support outgoing fixed-ram stream format Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 21/30] migration/multifd: Support incoming " Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 22/30] tests/qtest: Add a multifd + fixed-ram migration test Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 23/30] migration: Add direct-io parameter Fabiano Rosas
2023-12-22 10:38   ` Markus Armbruster
2023-11-27 20:26 ` [RFC PATCH v3 24/30] tests/qtest: Add a test for migration with direct-io and multifd Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 25/30] monitor: Honor QMP request for fd removal immediately Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 26/30] monitor: Extract fdset fd flags comparison into a function Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 27/30] monitor: fdset: Match against O_DIRECT Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 28/30] docs/devel/migration.rst: Document the file transport Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 29/30] migration: Add support for fdset with multifd + file Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 30/30] tests/qtest: Add a test for fixed-ram with passing of fds Fabiano Rosas
2024-01-11 10:50 ` [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Peter Xu
2024-01-11 18:38   ` Fabiano Rosas
2024-01-15  6:22     ` Peter Xu
2024-01-15  8:11       ` Daniel P. Berrangé
2024-01-15  8:41         ` Peter Xu
2024-01-15 19:45       ` Fabiano Rosas
2024-01-15 23:20         ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87fryvdeco.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=cfontana@suse.de \
    --cc=leobras@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).