Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Hanna Czenczek <hreitz@redhat.com>
To: Eric Blake <eblake@redhat.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org,
	Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>,
	Kevin Wolf <kwolf@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>, Fam Zheng <fam@euphon.net>
Subject: Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX
Date: Thu, 16 Mar 2023 10:43:38 +0100	[thread overview]
Message-ID: <007b6adc-0c79-3d5c-92d4-108e14a36424@redhat.com> (raw)
In-Reply-To: <20230315182501.w5zed6yktlfeytlf@redhat.com>

On 15.03.23 19:25, Eric Blake wrote:
> On Wed, Mar 15, 2023 at 01:13:29PM +0100, Hanna Czenczek wrote:
>> When processing vectored guest requests that are not aligned to the
>> storage request alignment, we pad them by adding head and/or tail
>> buffers for a read-modify-write cycle.
>>
>> The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
>> with this padding, the vector can exceed that limit.  As of
>> 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
>> qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
>> limit, instead returning an error to the guest.
>>
>> To the guest, this appears as a random I/O error.  We should not return
>> an I/O error to the guest when it issued a perfectly valid request.
>>
>> Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
>> longer than IOV_MAX, which generally seems to work (because the guest
>> assumes a smaller alignment than we really have, file-posix's
>> raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
>> so emulate the request, so that the IOV_MAX does not matter).  However,
>> that does not seem exactly great.
>>
>> I see two ways to fix this problem:
>> 1. We split such long requests into two requests.
>> 2. We join some elements of the vector into new buffers to make it
>>     shorter.
>>
>> I am wary of (1), because it seems like it may have unintended side
>> effects.
>>
>> (2) on the other hand seems relatively simple to implement, with
>> hopefully few side effects, so this patch does that.
> Agreed that approach 2 is more conservative.
>
>> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   block/io.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>   util/iov.c |   4 --
>>   2 files changed, 133 insertions(+), 10 deletions(-)
>>   
>> +/*
>> + * If padding has made the IOV (`pad->local_qiov`) too long (more than IOV_MAX
>> + * elements), collapse some elements into a single one so that it adheres to the
>> + * IOV_MAX limit again.
>> + *
>> + * If collapsing, `pad->collapse_buf` will be used as a bounce buffer of length
>> + * `pad->collapse_len`.  `pad->collapsed_qiov` will contain the previous entries
>> + * (before collapsing), so that bdrv_padding_destroy() can copy the bounce
>> + * buffer content back for read requests.
>> + *
>> + * Note that we will not touch the padding head or tail entries here.  We cannot
>> + * move them to a bounce buffer, because for RMWs, both head and tail expect to
>> + * be in an aligned buffer with scratch space after (head) or before (tail) to
>> + * perform the read into (because the whole buffer must be aligned, but head's
>> + * and tail's lengths naturally cannot be aligned, because they provide padding
>> + * for unaligned requests).  A collapsed bounce buffer for multiple IOV elements
>> + * cannot provide such scratch space.
>> + *
>> + * Therefore, this function collapses the first IOV elements after the
>> + * (potential) head element.
> It looks like you blindly pick the first one or two non-padding iovs
> at the front of the array.  Would it be any wiser (in terms of less
> memmove() action or even a smaller bounce buffer) to pick iovs at the
> end of the array, and/or a sequential search for the smallest
> neighboring iovs?  Or is that a micro-optimization that costs more
> than it saves?

Right, I didn’t think of picking near the end.  That makes sense 
indeed!  If not for performance, at least it allows dropping the 
non-trivial comment for the memmove().

Searching for the smallest buffers, I’m not sure.  I think it can make 
sense performance-wise – iterating over 1024 elements will probably pay 
off quickly when you indeed have differences in buffer size there.  My 
main concern is that it would be more complicated, and I just don’t 
think that’s worth it for such a rare case.

> Would it be any easier to swap the order of padding vs. collapsing?
> That is, we already know the user is giving us a long list of iovs; if
> it is 1024 elements long, and we can detect that padding will be
> needed, should we collapse before padding instead of padding, finding
> that we now have 1026, and memmove'ing back into 1024?

I’d prefer that, but it’s difficult.  We need the temporary QIOV 
(pad->local_qiov) so we can merge entries, but this is only created by 
qemu_iovec_init_extended().

We can try to move the collapsing into qemu_iovec_init_extended(), but 
it would be a bit awkward still, and probably blow up that function’s 
interface (it’s in util/iov.c, so we can’t really immediately use the 
BdrvRequestPadding object).  I think, in the end, functionally not much 
would change, so I’d rather keep the order as it is (unless someone has 
a good idea here).

> But logic-wise, your patch looks correct to me.
>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks!

Hanna

next prev parent reply	other threads:[~2023-03-16  9:44 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-15 12:13 [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
2023-03-15 12:13 ` [RFC 1/2] block: " Hanna Czenczek
2023-03-15 18:25   ` Eric Blake
2023-03-16  9:43     ` Hanna Czenczek [this message]
2023-03-15 18:48   ` Stefan Hajnoczi
2023-03-16 10:10     ` Hanna Czenczek
2023-03-16 17:44   ` Vladimir Sementsov-Ogievskiy
2023-03-17  8:05     ` Hanna Czenczek
2023-03-15 12:13 ` [RFC 2/2] iotests/iov-padding: New test Hanna Czenczek
2023-03-15 15:29 ` [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX Stefan Hajnoczi
2023-03-15 16:05   ` Hanna Czenczek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=007b6adc-0c79-3d5c-92d4-108e14a36424@redhat.com \
    --to=hreitz@redhat.com \
    --cc=eblake@redhat.com \
    --cc=fam@euphon.net \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=vsementsov@yandex-team.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).