From: Stefan Hajnoczi <stefanha@redhat.com>
To: Hanna Czenczek <hreitz@redhat.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org,
Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>,
Eric Blake <eblake@redhat.com>, Kevin Wolf <kwolf@redhat.com>,
Fam Zheng <fam@euphon.net>
Subject: Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX
Date: Wed, 15 Mar 2023 14:48:22 -0400 [thread overview]
Message-ID: <20230315184822.GE16636@fedora> (raw)
In-Reply-To: <20230315121330.29679-2-hreitz@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 13650 bytes --]
On Wed, Mar 15, 2023 at 01:13:29PM +0100, Hanna Czenczek wrote:
> When processing vectored guest requests that are not aligned to the
> storage request alignment, we pad them by adding head and/or tail
> buffers for a read-modify-write cycle.
>
> The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
> with this padding, the vector can exceed that limit. As of
> 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
> qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
> limit, instead returning an error to the guest.
>
> To the guest, this appears as a random I/O error. We should not return
> an I/O error to the guest when it issued a perfectly valid request.
>
> Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
> longer than IOV_MAX, which generally seems to work (because the guest
> assumes a smaller alignment than we really have, file-posix's
> raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
> so emulate the request, so that the IOV_MAX does not matter). However,
> that does not seem exactly great.
>
> I see two ways to fix this problem:
> 1. We split such long requests into two requests.
> 2. We join some elements of the vector into new buffers to make it
> shorter.
>
> I am wary of (1), because it seems like it may have unintended side
> effects.
>
> (2) on the other hand seems relatively simple to implement, with
> hopefully few side effects, so this patch does that.
Looks like a reasonable solution. I think the code is correct and I
posted ideas for making it easier to understand.
>
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
> block/io.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> util/iov.c | 4 --
> 2 files changed, 133 insertions(+), 10 deletions(-)
>
> diff --git a/block/io.c b/block/io.c
> index 8974d46941..ee226d23d6 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1435,6 +1435,12 @@ out:
> * @merge_reads is true for small requests,
> * if @buf_len == @head + bytes + @tail. In this case it is possible that both
> * head and tail exist but @buf_len == align and @tail_buf == @buf.
> + *
> + * @write is true for write requests, false for read requests.
> + *
> + * If padding makes the vector too long (exceeding IOV_MAX), then we need to
> + * merge existing vector elements into a single one. @collapse_buf acts as the
> + * bounce buffer in such cases.
> */
> typedef struct BdrvRequestPadding {
> uint8_t *buf;
> @@ -1443,11 +1449,17 @@ typedef struct BdrvRequestPadding {
> size_t head;
> size_t tail;
> bool merge_reads;
> + bool write;
> QEMUIOVector local_qiov;
> +
> + uint8_t *collapse_buf;
> + size_t collapse_len;
> + QEMUIOVector collapsed_qiov;
> } BdrvRequestPadding;
>
> static bool bdrv_init_padding(BlockDriverState *bs,
> int64_t offset, int64_t bytes,
> + bool write,
> BdrvRequestPadding *pad)
> {
> int64_t align = bs->bl.request_alignment;
> @@ -1479,9 +1491,101 @@ static bool bdrv_init_padding(BlockDriverState *bs,
> pad->tail_buf = pad->buf + pad->buf_len - align;
> }
>
> + pad->write = write;
> +
> return true;
> }
>
> +/*
> + * If padding has made the IOV (`pad->local_qiov`) too long (more than IOV_MAX
> + * elements), collapse some elements into a single one so that it adheres to the
> + * IOV_MAX limit again.
> + *
> + * If collapsing, `pad->collapse_buf` will be used as a bounce buffer of length
> + * `pad->collapse_len`. `pad->collapsed_qiov` will contain the previous entries
> + * (before collapsing), so that bdrv_padding_destroy() can copy the bounce
> + * buffer content back for read requests.
The distinction between "collapse" and "collapsed" is subtle. I didn't
guess it right, I thought collapsed_qiov is a QEMUIOVector for
collapse_buf/collapse_len.
Please choose a name for collapsed_qiov that makes this clearer. Maybe
pre_collapse_qiov (i.e. the local_qiov iovecs that were replaced by
bdrv_padding_collapse)?
> + *
> + * Note that we will not touch the padding head or tail entries here. We cannot
> + * move them to a bounce buffer, because for RMWs, both head and tail expect to
> + * be in an aligned buffer with scratch space after (head) or before (tail) to
> + * perform the read into (because the whole buffer must be aligned, but head's
> + * and tail's lengths naturally cannot be aligned, because they provide padding
> + * for unaligned requests). A collapsed bounce buffer for multiple IOV elements
> + * cannot provide such scratch space.
As someone who hasn't looked at this code for a while, I don't
understand this paragraph. Can you expand on why RMW is problematic
here? If not, don't worry, it's hard to explain iov juggling.
> + *
> + * Therefore, this function collapses the first IOV elements after the
> + * (potential) head element.
> + */
> +static void bdrv_padding_collapse(BdrvRequestPadding *pad, BlockDriverState *bs)
> +{
> + int surplus_count, collapse_count;
> + struct iovec *collapse_iovs;
> + QEMUIOVector collapse_qiov;
> + size_t move_count;
> +
> + surplus_count = pad->local_qiov.niov - IOV_MAX;
> + /* Not exceeding the limit? Nothing to collapse. */
> + if (surplus_count <= 0) {
> + return;
> + }
> +
> + /*
> + * Only head and tail can have lead to the number of entries exceeding
> + * IOV_MAX, so we can exceed it by the head and tail at most
> + */
> + assert(surplus_count <= !!pad->head + !!pad->tail);
> +
> + /*
> + * We merge (collapse) `surplus_count` entries into the first entry that is
> + * not padding, i.e. we merge `surplus_count + 1` entries into entry 0 if
> + * there is no head, or entry 1 if there is one.
> + */
> + collapse_count = surplus_count + 1;
> + collapse_iovs = &pad->local_qiov.iov[pad->head ? 1 : 0];
> +
> + /* There must be no previously collapsed buffers in `pad` */
> + assert(pad->collapse_len == 0);
> + for (int i = 0; i < collapse_count; i++) {
> + pad->collapse_len += collapse_iovs[i].iov_len;
> + }
> + pad->collapse_buf = qemu_blockalign(bs, pad->collapse_len);
> +
> + /* Save the to-be-collapsed IOV elements in collapsed_qiov */
> + qemu_iovec_init_external(&collapse_qiov, collapse_iovs, collapse_count);
> + qemu_iovec_init_slice(&pad->collapsed_qiov,
Having collapse_qiov and collapsed_qiov in the same function is
confusing. IIUC collapse_qiov and collapsed_qiov have the same buffers
the same, except that the last iovec in collapse_qiov has its original
length from local_qiov, whereas collapsed_qiov may shrink the last
iovec.
Maybe just naming collapse_qiov "qiov" or "tmp_qiov" would be less
confusing because it avoids making collapse_buf/collapse_len vs
collapsed_qiov more confusing.
> + &collapse_qiov, 0, pad->collapse_len);
> + if (pad->write) {
> + /* For writes: Copy all to-be-collapsed data into collapse_buf */
> + qemu_iovec_to_buf(&pad->collapsed_qiov, 0,
> + pad->collapse_buf, pad->collapse_len);
> + }
> +
> + /* Create the collapsed entry in pad->local_qiov */
> + collapse_iovs[0] = (struct iovec){
> + .iov_base = pad->collapse_buf,
> + .iov_len = pad->collapse_len,
> + };
> +
> + /*
> + * To finalize collapsing, we must shift the rest of pad->local_qiov left by
> + * `surplus_count`, i.e. we must move all elements after `collapse_iovs` to
> + * immediately after the collapse target.
> + *
> + * Therefore, the memmove() target is `collapse_iovs[1]` and the source is
> + * `collapse_iovs[collapse_count]`. The number of elements to move is the
> + * number of elements remaining in `pad->local_qiov` after and including
> + * `collapse_iovs[collapse_count]`.
> + */
> + move_count = &pad->local_qiov.iov[pad->local_qiov.niov] -
> + &collapse_iovs[collapse_count];
> + memmove(&collapse_iovs[1],
> + &collapse_iovs[collapse_count],
> + move_count * sizeof(pad->local_qiov.iov[0]));
> +
> + pad->local_qiov.niov -= surplus_count;
> +}
> +
> static int coroutine_fn GRAPH_RDLOCK
> bdrv_padding_rmw_read(BdrvChild *child, BdrvTrackedRequest *req,
> BdrvRequestPadding *pad, bool zero_middle)
> @@ -1549,6 +1653,18 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
> qemu_vfree(pad->buf);
> qemu_iovec_destroy(&pad->local_qiov);
> }
> + if (pad->collapse_buf) {
> + if (!pad->write) {
> + /*
> + * If padding required elements in the vector to be collapsed into a
> + * bounce buffer, copy the bounce buffer content back
> + */
> + qemu_iovec_from_buf(&pad->collapsed_qiov, 0,
> + pad->collapse_buf, pad->collapse_len);
> + }
> + qemu_vfree(pad->collapse_buf);
> + qemu_iovec_destroy(&pad->collapsed_qiov);
> + }
This is safe because qemu_iovec_init_slice() took copies of local_qiov
iovecs instead of references, but the code requires less thinking if
collapsed_qiov is destroyed before local_qiov. This change would be nice
if you respin.
> memset(pad, 0, sizeof(*pad));
> }
>
> @@ -1559,6 +1675,8 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
> * read of padding, bdrv_padding_rmw_read() should be called separately if
> * needed.
> *
> + * @write is true for write requests, false for read requests.
> + *
> * Request parameters (@qiov, &qiov_offset, &offset, &bytes) are in-out:
> * - on function start they represent original request
> * - on failure or when padding is not needed they are unchanged
> @@ -1567,6 +1685,7 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
> static int bdrv_pad_request(BlockDriverState *bs,
> QEMUIOVector **qiov, size_t *qiov_offset,
> int64_t *offset, int64_t *bytes,
> + bool write,
> BdrvRequestPadding *pad, bool *padded,
> BdrvRequestFlags *flags)
> {
> @@ -1574,7 +1693,7 @@ static int bdrv_pad_request(BlockDriverState *bs,
>
> bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, &error_abort);
>
> - if (!bdrv_init_padding(bs, *offset, *bytes, pad)) {
> + if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) {
> if (padded) {
> *padded = false;
> }
> @@ -1589,6 +1708,14 @@ static int bdrv_pad_request(BlockDriverState *bs,
> bdrv_padding_destroy(pad);
> return ret;
> }
> +
> + /*
> + * If the IOV is too long after padding, merge (collapse) some entries to
> + * make it not exceed IOV_MAX
> + */
> + bdrv_padding_collapse(pad, bs);
> + assert(pad->local_qiov.niov <= IOV_MAX);
> +
> *bytes += pad->head + pad->tail;
> *offset -= pad->head;
> *qiov = &pad->local_qiov;
> @@ -1653,8 +1780,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
> flags |= BDRV_REQ_COPY_ON_READ;
> }
>
> - ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
> - NULL, &flags);
> + ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
> + &pad, NULL, &flags);
> if (ret < 0) {
> goto fail;
> }
> @@ -1996,7 +2123,7 @@ bdrv_co_do_zero_pwritev(BdrvChild *child, int64_t offset, int64_t bytes,
> /* This flag doesn't make sense for padding or zero writes */
> flags &= ~BDRV_REQ_REGISTERED_BUF;
>
> - padding = bdrv_init_padding(bs, offset, bytes, &pad);
> + padding = bdrv_init_padding(bs, offset, bytes, true, &pad);
> if (padding) {
> assert(!(flags & BDRV_REQ_NO_WAIT));
> bdrv_make_request_serialising(req, align);
> @@ -2112,8 +2239,8 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
> * bdrv_co_do_zero_pwritev() does aligning by itself, so, we do
> * alignment only if there is no ZERO flag.
> */
> - ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
> - &padded, &flags);
> + ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, true,
> + &pad, &padded, &flags);
> if (ret < 0) {
> return ret;
> }
> diff --git a/util/iov.c b/util/iov.c
> index b4be580022..4d0d381949 100644
> --- a/util/iov.c
> +++ b/util/iov.c
> @@ -444,10 +444,6 @@ int qemu_iovec_init_extended(
> }
>
> total_niov = !!head_len + mid_niov + !!tail_len;
> - if (total_niov > IOV_MAX) {
> - return -EINVAL;
> - }
Perhaps an assertion is good here, just to make sure it's detected if a
new caller is added some day. qemu_iovec_init_extended() is a public
API, so something unrelated to block layer padding might use it.
> -
> if (total_niov == 1) {
> qemu_iovec_init_buf(qiov, NULL, 0);
> p = &qiov->local_iov;
> --
> 2.39.1
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
next prev parent reply other threads:[~2023-03-15 19:02 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-15 12:13 [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
2023-03-15 12:13 ` [RFC 1/2] block: " Hanna Czenczek
2023-03-15 18:25 ` Eric Blake
2023-03-16 9:43 ` Hanna Czenczek
2023-03-15 18:48 ` Stefan Hajnoczi [this message]
2023-03-16 10:10 ` Hanna Czenczek
2023-03-16 17:44 ` Vladimir Sementsov-Ogievskiy
2023-03-17 8:05 ` Hanna Czenczek
2023-03-15 12:13 ` [RFC 2/2] iotests/iov-padding: New test Hanna Czenczek
2023-03-15 15:29 ` [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX Stefan Hajnoczi
2023-03-15 16:05 ` Hanna Czenczek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230315184822.GE16636@fedora \
--to=stefanha@redhat.com \
--cc=eblake@redhat.com \
--cc=fam@euphon.net \
--cc=hreitz@redhat.com \
--cc=kwolf@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=vsementsov@yandex-team.ru \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).