From: Hanna Czenczek <hreitz@redhat.com>
To: Eric Blake <eblake@redhat.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org,
Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>,
Kevin Wolf <kwolf@redhat.com>,
Stefan Hajnoczi <stefanha@redhat.com>, Fam Zheng <fam@euphon.net>
Subject: Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX
Date: Thu, 16 Mar 2023 10:43:38 +0100 [thread overview]
Message-ID: <007b6adc-0c79-3d5c-92d4-108e14a36424@redhat.com> (raw)
In-Reply-To: <20230315182501.w5zed6yktlfeytlf@redhat.com>
On 15.03.23 19:25, Eric Blake wrote:
> On Wed, Mar 15, 2023 at 01:13:29PM +0100, Hanna Czenczek wrote:
>> When processing vectored guest requests that are not aligned to the
>> storage request alignment, we pad them by adding head and/or tail
>> buffers for a read-modify-write cycle.
>>
>> The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
>> with this padding, the vector can exceed that limit. As of
>> 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
>> qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
>> limit, instead returning an error to the guest.
>>
>> To the guest, this appears as a random I/O error. We should not return
>> an I/O error to the guest when it issued a perfectly valid request.
>>
>> Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
>> longer than IOV_MAX, which generally seems to work (because the guest
>> assumes a smaller alignment than we really have, file-posix's
>> raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
>> so emulate the request, so that the IOV_MAX does not matter). However,
>> that does not seem exactly great.
>>
>> I see two ways to fix this problem:
>> 1. We split such long requests into two requests.
>> 2. We join some elements of the vector into new buffers to make it
>> shorter.
>>
>> I am wary of (1), because it seems like it may have unintended side
>> effects.
>>
>> (2) on the other hand seems relatively simple to implement, with
>> hopefully few side effects, so this patch does that.
> Agreed that approach 2 is more conservative.
>
>> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>> block/io.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>> util/iov.c | 4 --
>> 2 files changed, 133 insertions(+), 10 deletions(-)
>>
>> +/*
>> + * If padding has made the IOV (`pad->local_qiov`) too long (more than IOV_MAX
>> + * elements), collapse some elements into a single one so that it adheres to the
>> + * IOV_MAX limit again.
>> + *
>> + * If collapsing, `pad->collapse_buf` will be used as a bounce buffer of length
>> + * `pad->collapse_len`. `pad->collapsed_qiov` will contain the previous entries
>> + * (before collapsing), so that bdrv_padding_destroy() can copy the bounce
>> + * buffer content back for read requests.
>> + *
>> + * Note that we will not touch the padding head or tail entries here. We cannot
>> + * move them to a bounce buffer, because for RMWs, both head and tail expect to
>> + * be in an aligned buffer with scratch space after (head) or before (tail) to
>> + * perform the read into (because the whole buffer must be aligned, but head's
>> + * and tail's lengths naturally cannot be aligned, because they provide padding
>> + * for unaligned requests). A collapsed bounce buffer for multiple IOV elements
>> + * cannot provide such scratch space.
>> + *
>> + * Therefore, this function collapses the first IOV elements after the
>> + * (potential) head element.
> It looks like you blindly pick the first one or two non-padding iovs
> at the front of the array. Would it be any wiser (in terms of less
> memmove() action or even a smaller bounce buffer) to pick iovs at the
> end of the array, and/or a sequential search for the smallest
> neighboring iovs? Or is that a micro-optimization that costs more
> than it saves?
Right, I didn’t think of picking near the end. That makes sense
indeed! If not for performance, at least it allows dropping the
non-trivial comment for the memmove().
Searching for the smallest buffers, I’m not sure. I think it can make
sense performance-wise – iterating over 1024 elements will probably pay
off quickly when you indeed have differences in buffer size there. My
main concern is that it would be more complicated, and I just don’t
think that’s worth it for such a rare case.
> Would it be any easier to swap the order of padding vs. collapsing?
> That is, we already know the user is giving us a long list of iovs; if
> it is 1024 elements long, and we can detect that padding will be
> needed, should we collapse before padding instead of padding, finding
> that we now have 1026, and memmove'ing back into 1024?
I’d prefer that, but it’s difficult. We need the temporary QIOV
(pad->local_qiov) so we can merge entries, but this is only created by
qemu_iovec_init_extended().
We can try to move the collapsing into qemu_iovec_init_extended(), but
it would be a bit awkward still, and probably blow up that function’s
interface (it’s in util/iov.c, so we can’t really immediately use the
BdrvRequestPadding object). I think, in the end, functionally not much
would change, so I’d rather keep the order as it is (unless someone has
a good idea here).
> But logic-wise, your patch looks correct to me.
>
> Reviewed-by: Eric Blake <eblake@redhat.com>
Thanks!
Hanna
next prev parent reply other threads:[~2023-03-16 9:44 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-15 12:13 [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
2023-03-15 12:13 ` [RFC 1/2] block: " Hanna Czenczek
2023-03-15 18:25 ` Eric Blake
2023-03-16 9:43 ` Hanna Czenczek [this message]
2023-03-15 18:48 ` Stefan Hajnoczi
2023-03-16 10:10 ` Hanna Czenczek
2023-03-16 17:44 ` Vladimir Sementsov-Ogievskiy
2023-03-17 8:05 ` Hanna Czenczek
2023-03-15 12:13 ` [RFC 2/2] iotests/iov-padding: New test Hanna Czenczek
2023-03-15 15:29 ` [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX Stefan Hajnoczi
2023-03-15 16:05 ` Hanna Czenczek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=007b6adc-0c79-3d5c-92d4-108e14a36424@redhat.com \
--to=hreitz@redhat.com \
--cc=eblake@redhat.com \
--cc=fam@euphon.net \
--cc=kwolf@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
--cc=vsementsov@yandex-team.ru \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).