[PATCH 0/4] block: Split padded I/O vectors exceeding IOV

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] block: Split padded I/O vectors exceeding IOV_MAX
@ 2023-03-17 17:50 Hanna Czenczek
  2023-03-17 17:50 ` [PATCH 1/4] util/iov: Make qiov_slice() public Hanna Czenczek
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Hanna Czenczek @ 2023-03-17 17:50 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Vladimir Sementsov-Ogievskiy,
	Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

RFC:
https://lists.nongnu.org/archive/html/qemu-block/2023-03/msg00446.html

Thanks for the feedback on the RFC!  Sounds like we agree that this is
the right way to fix the bug.

Here in v1, I’ve followed Vladimir’s suggestion to inline the
functionality of qemu_iovec_init_extended() in block/io.c, which, I
think (hope), also addresses much of the feedback of Eric and Stefan.

The test is unchanged, the rest is pretty much rewritten (though in
spirit stays the same).


Hanna Czenczek (4):
  util/iov: Make qiov_slice() public
  block: Split padded I/O vectors exceeding IOV_MAX
  util/iov: Remove qemu_iovec_init_extended()
  iotests/iov-padding: New test

 include/qemu/iov.h                       |   8 +-
 block/io.c                               | 153 +++++++++++++++++++++--
 util/iov.c                               |  89 +++----------
 tests/qemu-iotests/tests/iov-padding     |  85 +++++++++++++
 tests/qemu-iotests/tests/iov-padding.out |  59 +++++++++
 5 files changed, 306 insertions(+), 88 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/iov-padding
 create mode 100644 tests/qemu-iotests/tests/iov-padding.out

-- 
2.39.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/4] util/iov: Make qiov_slice() public
  2023-03-17 17:50 [PATCH 0/4] block: Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
@ 2023-03-17 17:50 ` Hanna Czenczek
  2023-03-18 12:06   ` Eric Blake
  2023-03-20 10:00   ` Vladimir Sementsov-Ogievskiy
  2023-03-17 17:50 ` [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 21+ messages in thread
From: Hanna Czenczek @ 2023-03-17 17:50 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Vladimir Sementsov-Ogievskiy,
	Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

We want to inline qemu_iovec_init_extended() in block/io.c for padding
requests, and having access to qiov_slice() is useful for this.

(We will need to count the number of I/O vector elements of a slice
there, and then later process this slice.  Without qiov_slice(), we
would need to call qemu_iovec_subvec_niov(), and all further
IOV-processing functions may need to skip prefixing elements to
accomodate for a qiov_offset.  Because qemu_iovec_subvec_niov()
internally calls qiov_slice(), we can just have the block/io.c code call
qiov_slice() itself, thus get the number of elements, and also create an
iovec array with the superfluous prefixing elements stripped, so the
following processing functions no longer need to skip them.)

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 include/qemu/iov.h |  3 +++
 util/iov.c         | 14 +++++++-------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/qemu/iov.h b/include/qemu/iov.h
index 9330746680..46fadfb27a 100644
--- a/include/qemu/iov.h
+++ b/include/qemu/iov.h
@@ -229,6 +229,9 @@ int qemu_iovec_init_extended(
         void *tail_buf, size_t tail_len);
 void qemu_iovec_init_slice(QEMUIOVector *qiov, QEMUIOVector *source,
                            size_t offset, size_t len);
+struct iovec *qemu_iovec_slice(QEMUIOVector *qiov,
+                               size_t offset, size_t len,
+                               size_t *head, size_t *tail, int *niov);
 int qemu_iovec_subvec_niov(QEMUIOVector *qiov, size_t offset, size_t len);
 void qemu_iovec_add(QEMUIOVector *qiov, void *base, size_t len);
 void qemu_iovec_concat(QEMUIOVector *dst,
diff --git a/util/iov.c b/util/iov.c
index b4be580022..65a70449da 100644
--- a/util/iov.c
+++ b/util/iov.c
@@ -378,15 +378,15 @@ static struct iovec *iov_skip_offset(struct iovec *iov, size_t offset,
 }
 
 /*
- * qiov_slice
+ * qemu_iovec_slice
  *
  * Find subarray of iovec's, containing requested range. @head would
  * be offset in first iov (returned by the function), @tail would be
  * count of extra bytes in last iovec (returned iov + @niov - 1).
  */
-static struct iovec *qiov_slice(QEMUIOVector *qiov,
-                                size_t offset, size_t len,
-                                size_t *head, size_t *tail, int *niov)
+struct iovec *qemu_iovec_slice(QEMUIOVector *qiov,
+                               size_t offset, size_t len,
+                               size_t *head, size_t *tail, int *niov)
 {
     struct iovec *iov, *end_iov;
 
@@ -411,7 +411,7 @@ int qemu_iovec_subvec_niov(QEMUIOVector *qiov, size_t offset, size_t len)
     size_t head, tail;
     int niov;
 
-    qiov_slice(qiov, offset, len, &head, &tail, &niov);
+    qemu_iovec_slice(qiov, offset, len, &head, &tail, &niov);
 
     return niov;
 }
@@ -439,8 +439,8 @@ int qemu_iovec_init_extended(
     }
 
     if (mid_len) {
-        mid_iov = qiov_slice(mid_qiov, mid_offset, mid_len,
-                             &mid_head, &mid_tail, &mid_niov);
+        mid_iov = qemu_iovec_slice(mid_qiov, mid_offset, mid_len,
+                                   &mid_head, &mid_tail, &mid_niov);
     }
 
     total_niov = !!head_len + mid_niov + !!tail_len;
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX
  2023-03-17 17:50 [PATCH 0/4] block: Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
  2023-03-17 17:50 ` [PATCH 1/4] util/iov: Make qiov_slice() public Hanna Czenczek
@ 2023-03-17 17:50 ` Hanna Czenczek
  2023-03-18 12:19   ` Eric Blake
  2023-03-20 10:31   ` Vladimir Sementsov-Ogievskiy
  2023-03-17 17:50 ` [PATCH 3/4] util/iov: Remove qemu_iovec_init_extended() Hanna Czenczek
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 21+ messages in thread
From: Hanna Czenczek @ 2023-03-17 17:50 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Vladimir Sementsov-Ogievskiy,
	Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

When processing vectored guest requests that are not aligned to the
storage request alignment, we pad them by adding head and/or tail
buffers for a read-modify-write cycle.

The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
with this padding, the vector can exceed that limit.  As of
4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
limit, instead returning an error to the guest.

To the guest, this appears as a random I/O error.  We should not return
an I/O error to the guest when it issued a perfectly valid request.

Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
longer than IOV_MAX, which generally seems to work (because the guest
assumes a smaller alignment than we really have, file-posix's
raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
so emulate the request, so that the IOV_MAX does not matter).  However,
that does not seem exactly great.

I see two ways to fix this problem:
1. We split such long requests into two requests.
2. We join some elements of the vector into new buffers to make it
   shorter.

I am wary of (1), because it seems like it may have unintended side
effects.

(2) on the other hand seems relatively simple to implement, with
hopefully few side effects, so this patch does that.

To do this, the use of qemu_iovec_init_extended() in bdrv_pad_request()
is effectively replaced by the new function bdrv_create_padded_qiov(),
which not only wraps the request IOV with padding head/tail, but also
ensures that the resulting vector will not have more than IOV_MAX
elements.  Putting that functionality into qemu_iovec_init_extended() is
infeasible because it requires allocating a bounce buffer; doing so
would require many more parameters (buffer alignment, how to initialize
the buffer, and out parameters like the buffer, its length, and the
original elements), which is not reasonable.

Conversely, it is not difficult to move qemu_iovec_init_extended()'s
functionality into bdrv_create_padded_qiov() by using public
qemu_iovec_* functions, so that is what this patch does.

Because bdrv_pad_request() was the only "serious" user of
qemu_iovec_init_extended(), the next patch will remove the latter
function, so the functionality is not implemented twice.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/io.c | 153 +++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 143 insertions(+), 10 deletions(-)

diff --git a/block/io.c b/block/io.c
index 8974d46941..1e9cdba17a 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1435,6 +1435,14 @@ out:
  * @merge_reads is true for small requests,
  * if @buf_len == @head + bytes + @tail. In this case it is possible that both
  * head and tail exist but @buf_len == align and @tail_buf == @buf.
+ *
+ * @write is true for write requests, false for read requests.
+ *
+ * If padding makes the vector too long (exceeding IOV_MAX), then we need to
+ * merge existing vector elements into a single one.  @collapse_bounce_buf acts
+ * as the bounce buffer in such cases.  @pre_collapse_qiov has the pre-collapse
+ * I/O vector elements so for read requests, the data can be copied back after
+ * the read is done.
  */
 typedef struct BdrvRequestPadding {
     uint8_t *buf;
@@ -1443,11 +1451,17 @@ typedef struct BdrvRequestPadding {
     size_t head;
     size_t tail;
     bool merge_reads;
+    bool write;
     QEMUIOVector local_qiov;
+
+    uint8_t *collapse_bounce_buf;
+    size_t collapse_len;
+    QEMUIOVector pre_collapse_qiov;
 } BdrvRequestPadding;
 
 static bool bdrv_init_padding(BlockDriverState *bs,
                               int64_t offset, int64_t bytes,
+                              bool write,
                               BdrvRequestPadding *pad)
 {
     int64_t align = bs->bl.request_alignment;
@@ -1479,6 +1493,8 @@ static bool bdrv_init_padding(BlockDriverState *bs,
         pad->tail_buf = pad->buf + pad->buf_len - align;
     }
 
+    pad->write = write;
+
     return true;
 }
 
@@ -1545,6 +1561,18 @@ zero_mem:
 
 static void bdrv_padding_destroy(BdrvRequestPadding *pad)
 {
+    if (pad->collapse_bounce_buf) {
+        if (!pad->write) {
+            /*
+             * If padding required elements in the vector to be collapsed into a
+             * bounce buffer, copy the bounce buffer content back
+             */
+            qemu_iovec_from_buf(&pad->pre_collapse_qiov, 0,
+                                pad->collapse_bounce_buf, pad->collapse_len);
+        }
+        qemu_vfree(pad->collapse_bounce_buf);
+        qemu_iovec_destroy(&pad->pre_collapse_qiov);
+    }
     if (pad->buf) {
         qemu_vfree(pad->buf);
         qemu_iovec_destroy(&pad->local_qiov);
@@ -1552,6 +1580,101 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
     memset(pad, 0, sizeof(*pad));
 }
 
+/*
+ * Create pad->local_qiov by wrapping @iov in the padding head and tail, while
+ * ensuring that the resulting vector will not exceed IOV_MAX elements.
+ *
+ * To ensure this, when necessary, the first couple of elements (up to three)
+ * of @iov are merged into pad->collapse_bounce_buf and replaced by a reference
+ * to that bounce buffer in pad->local_qiov.
+ *
+ * After performing a read request, the data from the bounce buffer must be
+ * copied back into pad->pre_collapse_qiov (e.g. by bdrv_padding_destroy()).
+ */
+static int bdrv_create_padded_qiov(BlockDriverState *bs,
+                                   BdrvRequestPadding *pad,
+                                   struct iovec *iov, int niov,
+                                   size_t iov_offset, size_t bytes)
+{
+    int padded_niov, surplus_count, collapse_count;
+
+    /* Assert this invariant */
+    assert(niov <= IOV_MAX);
+
+    /*
+     * Cannot pad if resulting length would exceed SIZE_MAX.  Returning an error
+     * to the guest is not ideal, but there is little else we can do.  At least
+     * this will practically never happen on 64-bit systems.
+     */
+    if (SIZE_MAX - pad->head < bytes ||
+        SIZE_MAX - pad->head - bytes < pad->tail)
+    {
+        return -EINVAL;
+    }
+
+    /* Length of the resulting IOV if we just concatenated everything */
+    padded_niov = !!pad->head + niov + !!pad->tail;
+
+    qemu_iovec_init(&pad->local_qiov, MIN(padded_niov, IOV_MAX));
+
+    if (pad->head) {
+        qemu_iovec_add(&pad->local_qiov, pad->buf, pad->head);
+    }
+
+    /*
+     * If padded_niov > IOV_MAX, we cannot just concatenate everything.
+     * Instead, merge the first couple of elements of @iov to reduce the number
+     * of vector elements as necessary.
+     */
+    if (padded_niov > IOV_MAX) {
+        /*
+         * Only head and tail can have lead to the number of entries exceeding
+         * IOV_MAX, so we can exceed it by the head and tail at most.  We need
+         * to reduce the number of elements by `surplus_count`, so we merge that
+         * many elements plus one into one element.
+         */
+        surplus_count = padded_niov - IOV_MAX;
+        assert(surplus_count <= !!pad->head + !!pad->tail);
+        collapse_count = surplus_count + 1;
+
+        /*
+         * Move the elements to collapse into `pad->pre_collapse_qiov`, then
+         * advance `iov` (and associated variables) by those elements.
+         */
+        qemu_iovec_init(&pad->pre_collapse_qiov, collapse_count);
+        qemu_iovec_concat_iov(&pad->pre_collapse_qiov, iov,
+                              collapse_count, iov_offset, SIZE_MAX);
+        iov += collapse_count;
+        iov_offset = 0;
+        niov -= collapse_count;
+        bytes -= pad->pre_collapse_qiov.size;
+
+        /*
+         * Construct the bounce buffer to match the length of the to-collapse
+         * vector elements, and for write requests, initialize it with the data
+         * from those elements.  Then add it to `pad->local_qiov`.
+         */
+        pad->collapse_len = pad->pre_collapse_qiov.size;
+        pad->collapse_bounce_buf = qemu_blockalign(bs, pad->collapse_len);
+        if (pad->write) {
+            qemu_iovec_to_buf(&pad->pre_collapse_qiov, 0,
+                              pad->collapse_bounce_buf, pad->collapse_len);
+        }
+        qemu_iovec_add(&pad->local_qiov,
+                       pad->collapse_bounce_buf, pad->collapse_len);
+    }
+
+    qemu_iovec_concat_iov(&pad->local_qiov, iov, niov, iov_offset, bytes);
+
+    if (pad->tail) {
+        qemu_iovec_add(&pad->local_qiov,
+                       pad->buf + pad->buf_len - pad->tail, pad->tail);
+    }
+
+    assert(pad->local_qiov.niov == MIN(padded_niov, IOV_MAX));
+    return 0;
+}
+
 /*
  * bdrv_pad_request
  *
@@ -1559,6 +1682,8 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
  * read of padding, bdrv_padding_rmw_read() should be called separately if
  * needed.
  *
+ * @write is true for write requests, false for read requests.
+ *
  * Request parameters (@qiov, &qiov_offset, &offset, &bytes) are in-out:
  *  - on function start they represent original request
  *  - on failure or when padding is not needed they are unchanged
@@ -1567,24 +1692,32 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
 static int bdrv_pad_request(BlockDriverState *bs,
                             QEMUIOVector **qiov, size_t *qiov_offset,
                             int64_t *offset, int64_t *bytes,
+                            bool write,
                             BdrvRequestPadding *pad, bool *padded,
                             BdrvRequestFlags *flags)
 {
     int ret;
+    struct iovec *sliced_iov;
+    int sliced_niov;
+    size_t sliced_head, sliced_tail;
 
     bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, &error_abort);
 
-    if (!bdrv_init_padding(bs, *offset, *bytes, pad)) {
+    if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) {
         if (padded) {
             *padded = false;
         }
         return 0;
     }
 
-    ret = qemu_iovec_init_extended(&pad->local_qiov, pad->buf, pad->head,
-                                   *qiov, *qiov_offset, *bytes,
-                                   pad->buf + pad->buf_len - pad->tail,
-                                   pad->tail);
+    sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+                                  &sliced_head, &sliced_tail,
+                                  &sliced_niov);
+
+    /* Guaranteed by bdrv_check_qiov_request() */
+    assert(*bytes <= SIZE_MAX);
+    ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
+                                  sliced_head, *bytes);
     if (ret < 0) {
         bdrv_padding_destroy(pad);
         return ret;
@@ -1653,8 +1786,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
         flags |= BDRV_REQ_COPY_ON_READ;
     }
 
-    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
-                           NULL, &flags);
+    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
+                           &pad, NULL, &flags);
     if (ret < 0) {
         goto fail;
     }
@@ -1996,7 +2129,7 @@ bdrv_co_do_zero_pwritev(BdrvChild *child, int64_t offset, int64_t bytes,
     /* This flag doesn't make sense for padding or zero writes */
     flags &= ~BDRV_REQ_REGISTERED_BUF;
 
-    padding = bdrv_init_padding(bs, offset, bytes, &pad);
+    padding = bdrv_init_padding(bs, offset, bytes, true, &pad);
     if (padding) {
         assert(!(flags & BDRV_REQ_NO_WAIT));
         bdrv_make_request_serialising(req, align);
@@ -2112,8 +2245,8 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
          * bdrv_co_do_zero_pwritev() does aligning by itself, so, we do
          * alignment only if there is no ZERO flag.
          */
-        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
-                               &padded, &flags);
+        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, true,
+                               &pad, &padded, &flags);
         if (ret < 0) {
             return ret;
         }
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/4] util/iov: Remove qemu_iovec_init_extended()
  2023-03-17 17:50 [PATCH 0/4] block: Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
  2023-03-17 17:50 ` [PATCH 1/4] util/iov: Make qiov_slice() public Hanna Czenczek
  2023-03-17 17:50 ` [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
@ 2023-03-17 17:50 ` Hanna Czenczek
  2023-03-18 12:22   ` Eric Blake
  2023-03-20 11:59   ` Vladimir Sementsov-Ogievskiy
  2023-03-17 17:50 ` [PATCH 4/4] iotests/iov-padding: New test Hanna Czenczek
  2023-04-05 19:44 ` [PATCH 0/4] block: Split padded I/O vectors exceeding IOV_MAX Stefan Hajnoczi
  4 siblings, 2 replies; 21+ messages in thread
From: Hanna Czenczek @ 2023-03-17 17:50 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Vladimir Sementsov-Ogievskiy,
	Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

bdrv_pad_request() was the main user of qemu_iovec_init_extended().
HEAD^ has removed that use, so we can remove qemu_iovec_init_extended()
now.

The only remaining user is qemu_iovec_init_slice(), which can easily
inline the small part it really needs.

Note that qemu_iovec_init_extended() offered a memcpy() optimization to
initialize the new I/O vector.  qemu_iovec_concat_iov(), which is used
to replace its functionality, does not, but calls qemu_iovec_add() for
every single element.  If we decide this optimization was important, we
will need to re-implement it in qemu_iovec_concat_iov(), which might
also benefit its pre-existing users.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 include/qemu/iov.h |  5 ---
 util/iov.c         | 79 +++++++---------------------------------------
 2 files changed, 11 insertions(+), 73 deletions(-)

diff --git a/include/qemu/iov.h b/include/qemu/iov.h
index 46fadfb27a..63a1c01965 100644
--- a/include/qemu/iov.h
+++ b/include/qemu/iov.h
@@ -222,11 +222,6 @@ static inline void *qemu_iovec_buf(QEMUIOVector *qiov)
 
 void qemu_iovec_init(QEMUIOVector *qiov, int alloc_hint);
 void qemu_iovec_init_external(QEMUIOVector *qiov, struct iovec *iov, int niov);
-int qemu_iovec_init_extended(
-        QEMUIOVector *qiov,
-        void *head_buf, size_t head_len,
-        QEMUIOVector *mid_qiov, size_t mid_offset, size_t mid_len,
-        void *tail_buf, size_t tail_len);
 void qemu_iovec_init_slice(QEMUIOVector *qiov, QEMUIOVector *source,
                            size_t offset, size_t len);
 struct iovec *qemu_iovec_slice(QEMUIOVector *qiov,
diff --git a/util/iov.c b/util/iov.c
index 65a70449da..866fb577f3 100644
--- a/util/iov.c
+++ b/util/iov.c
@@ -416,70 +416,6 @@ int qemu_iovec_subvec_niov(QEMUIOVector *qiov, size_t offset, size_t len)
     return niov;
 }
 
-/*
- * Compile new iovec, combining @head_buf buffer, sub-qiov of @mid_qiov,
- * and @tail_buf buffer into new qiov.
- */
-int qemu_iovec_init_extended(
-        QEMUIOVector *qiov,
-        void *head_buf, size_t head_len,
-        QEMUIOVector *mid_qiov, size_t mid_offset, size_t mid_len,
-        void *tail_buf, size_t tail_len)
-{
-    size_t mid_head, mid_tail;
-    int total_niov, mid_niov = 0;
-    struct iovec *p, *mid_iov = NULL;
-
-    assert(mid_qiov->niov <= IOV_MAX);
-
-    if (SIZE_MAX - head_len < mid_len ||
-        SIZE_MAX - head_len - mid_len < tail_len)
-    {
-        return -EINVAL;
-    }
-
-    if (mid_len) {
-        mid_iov = qemu_iovec_slice(mid_qiov, mid_offset, mid_len,
-                                   &mid_head, &mid_tail, &mid_niov);
-    }
-
-    total_niov = !!head_len + mid_niov + !!tail_len;
-    if (total_niov > IOV_MAX) {
-        return -EINVAL;
-    }
-
-    if (total_niov == 1) {
-        qemu_iovec_init_buf(qiov, NULL, 0);
-        p = &qiov->local_iov;
-    } else {
-        qiov->niov = qiov->nalloc = total_niov;
-        qiov->size = head_len + mid_len + tail_len;
-        p = qiov->iov = g_new(struct iovec, qiov->niov);
-    }
-
-    if (head_len) {
-        p->iov_base = head_buf;
-        p->iov_len = head_len;
-        p++;
-    }
-
-    assert(!mid_niov == !mid_len);
-    if (mid_niov) {
-        memcpy(p, mid_iov, mid_niov * sizeof(*p));
-        p[0].iov_base = (uint8_t *)p[0].iov_base + mid_head;
-        p[0].iov_len -= mid_head;
-        p[mid_niov - 1].iov_len -= mid_tail;
-        p += mid_niov;
-    }
-
-    if (tail_len) {
-        p->iov_base = tail_buf;
-        p->iov_len = tail_len;
-    }
-
-    return 0;
-}
-
 /*
  * Check if the contents of subrange of qiov data is all zeroes.
  */
@@ -511,14 +447,21 @@ bool qemu_iovec_is_zero(QEMUIOVector *qiov, size_t offset, size_t bytes)
 void qemu_iovec_init_slice(QEMUIOVector *qiov, QEMUIOVector *source,
                            size_t offset, size_t len)
 {
-    int ret;
+    struct iovec *slice_iov;
+    int slice_niov;
+    size_t slice_head, slice_tail;
 
     assert(source->size >= len);
     assert(source->size - len >= offset);
 
-    /* We shrink the request, so we can't overflow neither size_t nor MAX_IOV */
-    ret = qemu_iovec_init_extended(qiov, NULL, 0, source, offset, len, NULL, 0);
-    assert(ret == 0);
+    slice_iov = qemu_iovec_slice(source, offset, len,
+                                 &slice_head, &slice_tail, &slice_niov);
+    if (slice_niov == 1) {
+        qemu_iovec_init_buf(qiov, slice_iov[0].iov_base + slice_head, len);
+    } else {
+        qemu_iovec_init(qiov, slice_niov);
+        qemu_iovec_concat_iov(qiov, slice_iov, slice_niov, slice_head, len);
+    }
 }
 
 void qemu_iovec_destroy(QEMUIOVector *qiov)
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/4] iotests/iov-padding: New test
  2023-03-17 17:50 [PATCH 0/4] block: Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
                   ` (2 preceding siblings ...)
  2023-03-17 17:50 ` [PATCH 3/4] util/iov: Remove qemu_iovec_init_extended() Hanna Czenczek
@ 2023-03-17 17:50 ` Hanna Czenczek
  2023-03-18 12:39   ` Eric Blake
  2023-03-20 12:12   ` Vladimir Sementsov-Ogievskiy
  2023-04-05 19:44 ` [PATCH 0/4] block: Split padded I/O vectors exceeding IOV_MAX Stefan Hajnoczi
  4 siblings, 2 replies; 21+ messages in thread
From: Hanna Czenczek @ 2023-03-17 17:50 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Vladimir Sementsov-Ogievskiy,
	Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

Test that even vectored IO requests with 1024 vector elements that are
not aligned to the device's request alignment will succeed.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 tests/qemu-iotests/tests/iov-padding     | 85 ++++++++++++++++++++++++
 tests/qemu-iotests/tests/iov-padding.out | 59 ++++++++++++++++
 2 files changed, 144 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/iov-padding
 create mode 100644 tests/qemu-iotests/tests/iov-padding.out

diff --git a/tests/qemu-iotests/tests/iov-padding b/tests/qemu-iotests/tests/iov-padding
new file mode 100755
index 0000000000..b9604900c7
--- /dev/null
+++ b/tests/qemu-iotests/tests/iov-padding
@@ -0,0 +1,85 @@
+#!/usr/bin/env bash
+# group: rw quick
+#
+# Check the interaction of request padding (to fit alignment restrictions) with
+# vectored I/O from the guest
+#
+# Copyright Red Hat
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+seq=$(basename $0)
+echo "QA output created by $seq"
+
+status=1	# failure is the default!
+
+_cleanup()
+{
+    _cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+cd ..
+. ./common.rc
+. ./common.filter
+
+_supported_fmt raw
+_supported_proto file
+
+_make_test_img 1M
+
+IMGSPEC="driver=blkdebug,align=4096,image.driver=file,image.filename=$TEST_IMG"
+
+# Four combinations:
+# - Offset 4096, length 1023 * 512 + 512: Fully aligned to 4k
+# - Offset 4096, length 1023 * 512 + 4096: Head is aligned, tail is not
+# - Offset 512, length 1023 * 512 + 512: Neither head nor tail are aligned
+# - Offset 512, length 1023 * 512 + 4096: Tail is aligned, head is not
+for start_offset in 4096 512; do
+    for last_element_length in 512 4096; do
+        length=$((1023 * 512 + $last_element_length))
+
+        echo
+        echo "== performing 1024-element vectored requests to image (offset: $start_offset; length: $length) =="
+
+        # Fill with data for testing
+        $QEMU_IO -c 'write -P 1 0 1M' "$TEST_IMG" | _filter_qemu_io
+
+        # 1023 512-byte buffers, and then one with length $last_element_length
+        cmd_params="-P 2 $start_offset $(yes 512 | head -n 1023 | tr '\n' ' ') $last_element_length"
+        QEMU_IO_OPTIONS="$QEMU_IO_OPTIONS_NO_FMT" $QEMU_IO \
+            -c "writev $cmd_params" \
+            --image-opts \
+            "$IMGSPEC" \
+            | _filter_qemu_io
+
+        # Read all patterns -- read the part we just wrote with writev twice,
+        # once "normally", and once with a readv, so we see that that works, too
+        QEMU_IO_OPTIONS="$QEMU_IO_OPTIONS_NO_FMT" $QEMU_IO \
+            -c "read -P 1 0 $start_offset" \
+            -c "read -P 2 $start_offset $length" \
+            -c "readv $cmd_params" \
+            -c "read -P 1 $((start_offset + length)) $((1024 * 1024 - length - start_offset))" \
+            --image-opts \
+            "$IMGSPEC" \
+            | _filter_qemu_io
+    done
+done
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/tests/iov-padding.out b/tests/qemu-iotests/tests/iov-padding.out
new file mode 100644
index 0000000000..e07a91fac7
--- /dev/null
+++ b/tests/qemu-iotests/tests/iov-padding.out
@@ -0,0 +1,59 @@
+QA output created by iov-padding
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
+
+== performing 1024-element vectored requests to image (offset: 4096; length: 524288) ==
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 524288/524288 bytes at offset 4096
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 4096/4096 bytes at offset 0
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 524288/524288 bytes at offset 4096
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 524288/524288 bytes at offset 4096
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 520192/520192 bytes at offset 528384
+508 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== performing 1024-element vectored requests to image (offset: 4096; length: 527872) ==
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 527872/527872 bytes at offset 4096
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 4096/4096 bytes at offset 0
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 527872/527872 bytes at offset 4096
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 527872/527872 bytes at offset 4096
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 516608/516608 bytes at offset 531968
+504.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== performing 1024-element vectored requests to image (offset: 512; length: 524288) ==
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 524288/524288 bytes at offset 512
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 524288/524288 bytes at offset 512
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 524288/524288 bytes at offset 512
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 523776/523776 bytes at offset 524800
+511.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== performing 1024-element vectored requests to image (offset: 512; length: 527872) ==
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 527872/527872 bytes at offset 512
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 527872/527872 bytes at offset 512
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 527872/527872 bytes at offset 512
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 520192/520192 bytes at offset 528384
+508 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+*** done
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] util/iov: Make qiov_slice() public
  2023-03-17 17:50 ` [PATCH 1/4] util/iov: Make qiov_slice() public Hanna Czenczek
@ 2023-03-18 12:06   ` Eric Blake
  2023-03-20 10:00   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 21+ messages in thread
From: Eric Blake @ 2023-03-18 12:06 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Vladimir Sementsov-Ogievskiy, Kevin Wolf,
	Stefan Hajnoczi, Fam Zheng

On Fri, Mar 17, 2023 at 06:50:16PM +0100, Hanna Czenczek wrote:
> We want to inline qemu_iovec_init_extended() in block/io.c for padding
> requests, and having access to qiov_slice() is useful for this.
> 
> (We will need to count the number of I/O vector elements of a slice
> there, and then later process this slice.  Without qiov_slice(), we
> would need to call qemu_iovec_subvec_niov(), and all further
> IOV-processing functions may need to skip prefixing elements to
> accomodate for a qiov_offset.  Because qemu_iovec_subvec_niov()
> internally calls qiov_slice(), we can just have the block/io.c code call
> qiov_slice() itself, thus get the number of elements, and also create an
> iovec array with the superfluous prefixing elements stripped, so the
> following processing functions no longer need to skip them.)

Might be worth mentioning in the commit message that you also renamed
it to qemu_iovec_slice() as part of exporting.

> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/qemu/iov.h |  3 +++
>  util/iov.c         | 14 +++++++-------
>  2 files changed, 10 insertions(+), 7 deletions(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX
  2023-03-17 17:50 ` [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
@ 2023-03-18 12:19   ` Eric Blake
  2023-03-20 10:31   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 21+ messages in thread
From: Eric Blake @ 2023-03-18 12:19 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Vladimir Sementsov-Ogievskiy, Kevin Wolf,
	Stefan Hajnoczi, Fam Zheng

On Fri, Mar 17, 2023 at 06:50:17PM +0100, Hanna Czenczek wrote:
> When processing vectored guest requests that are not aligned to the
> storage request alignment, we pad them by adding head and/or tail
> buffers for a read-modify-write cycle.
> 

> 
> To do this, the use of qemu_iovec_init_extended() in bdrv_pad_request()
> is effectively replaced by the new function bdrv_create_padded_qiov(),
> which not only wraps the request IOV with padding head/tail, but also
> ensures that the resulting vector will not have more than IOV_MAX
> elements.  Putting that functionality into qemu_iovec_init_extended() is
> infeasible because it requires allocating a bounce buffer; doing so
> would require many more parameters (buffer alignment, how to initialize
> the buffer, and out parameters like the buffer, its length, and the
> original elements), which is not reasonable.
> 
> Conversely, it is not difficult to move qemu_iovec_init_extended()'s
> functionality into bdrv_create_padded_qiov() by using public
> qemu_iovec_* functions, so that is what this patch does.
> 
> Because bdrv_pad_request() was the only "serious" user of
> qemu_iovec_init_extended(), the next patch will remove the latter
> function, so the functionality is not implemented twice.
> 
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/io.c | 153 +++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 143 insertions(+), 10 deletions(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] util/iov: Remove qemu_iovec_init_extended()
  2023-03-17 17:50 ` [PATCH 3/4] util/iov: Remove qemu_iovec_init_extended() Hanna Czenczek
@ 2023-03-18 12:22   ` Eric Blake
  2023-03-20 11:59   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 21+ messages in thread
From: Eric Blake @ 2023-03-18 12:22 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Vladimir Sementsov-Ogievskiy, Kevin Wolf,
	Stefan Hajnoczi, Fam Zheng

On Fri, Mar 17, 2023 at 06:50:18PM +0100, Hanna Czenczek wrote:
> bdrv_pad_request() was the main user of qemu_iovec_init_extended().
> HEAD^ has removed that use, so we can remove qemu_iovec_init_extended()
> now.
> 
> The only remaining user is qemu_iovec_init_slice(), which can easily
> inline the small part it really needs.
> 
> Note that qemu_iovec_init_extended() offered a memcpy() optimization to
> initialize the new I/O vector.  qemu_iovec_concat_iov(), which is used
> to replace its functionality, does not, but calls qemu_iovec_add() for
> every single element.  If we decide this optimization was important, we
> will need to re-implement it in qemu_iovec_concat_iov(), which might
> also benefit its pre-existing users.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/qemu/iov.h |  5 ---
>  util/iov.c         | 79 +++++++---------------------------------------
>  2 files changed, 11 insertions(+), 73 deletions(-)
>

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] iotests/iov-padding: New test
  2023-03-17 17:50 ` [PATCH 4/4] iotests/iov-padding: New test Hanna Czenczek
@ 2023-03-18 12:39   ` Eric Blake
  2023-04-03 14:23     ` Hanna Czenczek
  2023-03-20 12:12   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 21+ messages in thread
From: Eric Blake @ 2023-03-18 12:39 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Vladimir Sementsov-Ogievskiy, Kevin Wolf,
	Stefan Hajnoczi, Fam Zheng

On Fri, Mar 17, 2023 at 06:50:19PM +0100, Hanna Czenczek wrote:
> Test that even vectored IO requests with 1024 vector elements that are
> not aligned to the device's request alignment will succeed.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---

> +
> +# Four combinations:
> +# - Offset 4096, length 1023 * 512 + 512: Fully aligned to 4k
> +# - Offset 4096, length 1023 * 512 + 4096: Head is aligned, tail is not

Making sure I understand - we don't care if intermediate iovs are
smaller than the alignment requirement, because the scatter-gather
effects will pool it together and our overall request is still
aligned.  Correct?

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/4] util/iov: Make qiov_slice() public
  2023-03-17 17:50 ` [PATCH 1/4] util/iov: Make qiov_slice() public Hanna Czenczek
  2023-03-18 12:06   ` Eric Blake
@ 2023-03-20 10:00   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2023-03-20 10:00 UTC (permalink / raw)
  To: Hanna Czenczek, qemu-block
  Cc: qemu-devel, Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

On 17.03.23 20:50, Hanna Czenczek wrote:
> We want to inline qemu_iovec_init_extended() in block/io.c for padding
> requests, and having access to qiov_slice() is useful for this.
> 
> (We will need to count the number of I/O vector elements of a slice
> there, and then later process this slice.  Without qiov_slice(), we
> would need to call qemu_iovec_subvec_niov(), and all further
> IOV-processing functions may need to skip prefixing elements to
> accomodate for a qiov_offset.  Because qemu_iovec_subvec_niov()
> internally calls qiov_slice(), we can just have the block/io.c code call
> qiov_slice() itself, thus get the number of elements, and also create an
> iovec array with the superfluous prefixing elements stripped, so the
> following processing functions no longer need to skip them.)
> 
> Signed-off-by: Hanna Czenczek<hreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX
  2023-03-17 17:50 ` [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
  2023-03-18 12:19   ` Eric Blake
@ 2023-03-20 10:31   ` Vladimir Sementsov-Ogievskiy
  2023-04-03 13:33     ` Hanna Czenczek
  1 sibling, 1 reply; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2023-03-20 10:31 UTC (permalink / raw)
  To: Hanna Czenczek, qemu-block
  Cc: qemu-devel, Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

On 17.03.23 20:50, Hanna Czenczek wrote:
> When processing vectored guest requests that are not aligned to the
> storage request alignment, we pad them by adding head and/or tail
> buffers for a read-modify-write cycle.
> 
> The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
> with this padding, the vector can exceed that limit.  As of
> 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
> qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
> limit, instead returning an error to the guest.
> 
> To the guest, this appears as a random I/O error.  We should not return
> an I/O error to the guest when it issued a perfectly valid request.
> 
> Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
> longer than IOV_MAX, which generally seems to work (because the guest
> assumes a smaller alignment than we really have, file-posix's
> raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
> so emulate the request, so that the IOV_MAX does not matter).  However,
> that does not seem exactly great.
> 
> I see two ways to fix this problem:
> 1. We split such long requests into two requests.
> 2. We join some elements of the vector into new buffers to make it
>     shorter.
> 
> I am wary of (1), because it seems like it may have unintended side
> effects.
> 
> (2) on the other hand seems relatively simple to implement, with
> hopefully few side effects, so this patch does that.
> 
> To do this, the use of qemu_iovec_init_extended() in bdrv_pad_request()
> is effectively replaced by the new function bdrv_create_padded_qiov(),
> which not only wraps the request IOV with padding head/tail, but also
> ensures that the resulting vector will not have more than IOV_MAX
> elements.  Putting that functionality into qemu_iovec_init_extended() is
> infeasible because it requires allocating a bounce buffer; doing so
> would require many more parameters (buffer alignment, how to initialize
> the buffer, and out parameters like the buffer, its length, and the
> original elements), which is not reasonable.
> 
> Conversely, it is not difficult to move qemu_iovec_init_extended()'s
> functionality into bdrv_create_padded_qiov() by using public
> qemu_iovec_* functions, so that is what this patch does.
> 
> Because bdrv_pad_request() was the only "serious" user of
> qemu_iovec_init_extended(), the next patch will remove the latter
> function, so the functionality is not implemented twice.
> 
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>   block/io.c | 153 +++++++++++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 143 insertions(+), 10 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 8974d46941..1e9cdba17a 100644
> --- a/block/io.c
> +++ b/block/io.c

[..]

> +    pad->write = write;
> +
>       return true;
>   }
>   
> @@ -1545,6 +1561,18 @@ zero_mem:
>   
>   static void bdrv_padding_destroy(BdrvRequestPadding *pad)

Maybe, rename to _finalize, to stress that it's not only freeing memory.

>   {
> +    if (pad->collapse_bounce_buf) {
> +        if (!pad->write) {
> +            /*
> +             * If padding required elements in the vector to be collapsed into a
> +             * bounce buffer, copy the bounce buffer content back
> +             */
> +            qemu_iovec_from_buf(&pad->pre_collapse_qiov, 0,
> +                                pad->collapse_bounce_buf, pad->collapse_len);
> +        }
> +        qemu_vfree(pad->collapse_bounce_buf);
> +        qemu_iovec_destroy(&pad->pre_collapse_qiov);
> +    }
>       if (pad->buf) {
>           qemu_vfree(pad->buf);
>           qemu_iovec_destroy(&pad->local_qiov);
> @@ -1552,6 +1580,101 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>       memset(pad, 0, sizeof(*pad));
>   }
>   
> +/*
> + * Create pad->local_qiov by wrapping @iov in the padding head and tail, while
> + * ensuring that the resulting vector will not exceed IOV_MAX elements.
> + *
> + * To ensure this, when necessary, the first couple of elements (up to three)

maybe, "first two-three elements"

> + * of @iov are merged into pad->collapse_bounce_buf and replaced by a reference
> + * to that bounce buffer in pad->local_qiov.
> + *
> + * After performing a read request, the data from the bounce buffer must be
> + * copied back into pad->pre_collapse_qiov (e.g. by bdrv_padding_destroy()).
> + */
> +static int bdrv_create_padded_qiov(BlockDriverState *bs,
> +                                   BdrvRequestPadding *pad,
> +                                   struct iovec *iov, int niov,
> +                                   size_t iov_offset, size_t bytes)
> +{
> +    int padded_niov, surplus_count, collapse_count;
> +
> +    /* Assert this invariant */
> +    assert(niov <= IOV_MAX);
> +
> +    /*
> +     * Cannot pad if resulting length would exceed SIZE_MAX.  Returning an error
> +     * to the guest is not ideal, but there is little else we can do.  At least
> +     * this will practically never happen on 64-bit systems.
> +     */
> +    if (SIZE_MAX - pad->head < bytes ||
> +        SIZE_MAX - pad->head - bytes < pad->tail)
> +    {
> +        return -EINVAL;
> +    }
> +
> +    /* Length of the resulting IOV if we just concatenated everything */
> +    padded_niov = !!pad->head + niov + !!pad->tail;
> +
> +    qemu_iovec_init(&pad->local_qiov, MIN(padded_niov, IOV_MAX));
> +
> +    if (pad->head) {
> +        qemu_iovec_add(&pad->local_qiov, pad->buf, pad->head);
> +    }
> +
> +    /*
> +     * If padded_niov > IOV_MAX, we cannot just concatenate everything.
> +     * Instead, merge the first couple of elements of @iov to reduce the number

maybe, "first two-three elements"

> +     * of vector elements as necessary.
> +     */
> +    if (padded_niov > IOV_MAX) {
>

[..]

> @@ -1653,8 +1786,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
>           flags |= BDRV_REQ_COPY_ON_READ;
>       }
>   
> -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
> -                           NULL, &flags);
> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
> +                           &pad, NULL, &flags);
>       if (ret < 0) {
>           goto fail;
>       }

a bit later:

tracked_request_end(&req);
bdrv_padding_destroy(&pad);


Now, the request is formally finished inside bdrv_padding_destroy().. Not sure, does it really violate something, but seems safer to swap these two calls. With that:

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>



PS, I feel here still exists small space for optimization:

move the logic to bdrv_init_padding(), and

1. allocate only one buffer
2. make the new collpase are to be attached to head or tail padding
3. avoid creating extra iov-slice, maybe with help of some new qemu_iovec_* API that can control number of copied/to-be-copied iovs and/or calculation number of iovs in qiov/qiov_offset/bytes slice



-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/4] util/iov: Remove qemu_iovec_init_extended()
  2023-03-17 17:50 ` [PATCH 3/4] util/iov: Remove qemu_iovec_init_extended() Hanna Czenczek
  2023-03-18 12:22   ` Eric Blake
@ 2023-03-20 11:59   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2023-03-20 11:59 UTC (permalink / raw)
  To: Hanna Czenczek, qemu-block
  Cc: qemu-devel, Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

On 17.03.23 20:50, Hanna Czenczek wrote:
> bdrv_pad_request() was the main user of qemu_iovec_init_extended().
> HEAD^ has removed that use, so we can remove qemu_iovec_init_extended()
> now.
> 
> The only remaining user is qemu_iovec_init_slice(), which can easily
> inline the small part it really needs.
> 
> Note that qemu_iovec_init_extended() offered a memcpy() optimization to
> initialize the new I/O vector.  qemu_iovec_concat_iov(), which is used
> to replace its functionality, does not, but calls qemu_iovec_add() for
> every single element.  If we decide this optimization was important, we
> will need to re-implement it in qemu_iovec_concat_iov(), which might
> also benefit its pre-existing users.
> 
> Signed-off-by: Hanna Czenczek<hreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] iotests/iov-padding: New test
  2023-03-17 17:50 ` [PATCH 4/4] iotests/iov-padding: New test Hanna Czenczek
  2023-03-18 12:39   ` Eric Blake
@ 2023-03-20 12:12   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2023-03-20 12:12 UTC (permalink / raw)
  To: Hanna Czenczek, qemu-block
  Cc: qemu-devel, Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

On 17.03.23 20:50, Hanna Czenczek wrote:
> Test that even vectored IO requests with 1024 vector elements that are
> not aligned to the device's request alignment will succeed.
> 
> Signed-off-by: Hanna Czenczek<hreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX
  2023-03-20 10:31   ` Vladimir Sementsov-Ogievskiy
@ 2023-04-03 13:33     ` Hanna Czenczek
  2023-04-04  8:10       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 21+ messages in thread
From: Hanna Czenczek @ 2023-04-03 13:33 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: qemu-devel, Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

(Sorry for the rather late reply... Thanks for the review!)

On 20.03.23 11:31, Vladimir Sementsov-Ogievskiy wrote:
> On 17.03.23 20:50, Hanna Czenczek wrote:

[...]

>> diff --git a/block/io.c b/block/io.c
>> index 8974d46941..1e9cdba17a 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>
> [..]
>
>> +    pad->write = write;
>> +
>>       return true;
>>   }
>>   @@ -1545,6 +1561,18 @@ zero_mem:
>>     static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>
> Maybe, rename to _finalize, to stress that it's not only freeing memory.

Sounds good!

[...]

>> @@ -1552,6 +1580,101 @@ static void 
>> bdrv_padding_destroy(BdrvRequestPadding *pad)
>>       memset(pad, 0, sizeof(*pad));
>>   }
>>   +/*
>> + * Create pad->local_qiov by wrapping @iov in the padding head and 
>> tail, while
>> + * ensuring that the resulting vector will not exceed IOV_MAX elements.
>> + *
>> + * To ensure this, when necessary, the first couple of elements (up 
>> to three)
>
> maybe, "first two-three elements"

Sure (here and...

[...]

>> +    /*
>> +     * If padded_niov > IOV_MAX, we cannot just concatenate everything.
>> +     * Instead, merge the first couple of elements of @iov to reduce 
>> the number
>
> maybe, "first two-three elements"

...here).

>
>> +     * of vector elements as necessary.
>> +     */
>> +    if (padded_niov > IOV_MAX) {
>>
>
> [..]
>
>> @@ -1653,8 +1786,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild 
>> *child,
>>           flags |= BDRV_REQ_COPY_ON_READ;
>>       }
>>   -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, 
>> &bytes, &pad,
>> -                           NULL, &flags);
>> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, 
>> false,
>> +                           &pad, NULL, &flags);
>>       if (ret < 0) {
>>           goto fail;
>>       }
>
> a bit later:
>
> tracked_request_end(&req);
> bdrv_padding_destroy(&pad);
>
>
> Now, the request is formally finished inside bdrv_padding_destroy().. 
> Not sure, does it really violate something, but seems safer to swap 
> these two calls. 

I’d rather not, for two reasons: First, tracked requests are (as far as 
I understand) only there to implement request serialization, and so only 
care about metadata (offset, length, and type), which is not changed by 
changes to the I/O vector.

Second, even if the state of the I/O vector were relevant to tracked 
requests, I think it would actually be the other way around, i.e. the 
tracked request must be ended before the padding is 
finalized/destroyed.  The tracked request is about the actual request we 
submit to `child` (which is why tracked_request_begin() is called after 
bdrv_pad_request()), and that request is done using the modified I/O 
vector.  So if the tracked request had any connection to the request’s 
I/O vector (which it doesn’t), it would be to this modified one, so we 
mustn’t invalidate it via bdrv_padding_finalize() while the tracked 
request lives.

Or, said differently: I generally try to clean up things in the inverse 
way they were set up, and because bdrv_pad_requests() comes before 
tracked_request_begin(), I think tracked_request_end() should come 
before bdrv_padding_finalize().

> With that:
>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
>
>
>
> PS, I feel here still exists small space for optimization:

The question is whether any optimization is really worth it, and I’m not 
sure it is.  The bug has been in qemu for over two years, and because 
the only report I’ve seen about it came from our QE department, it seems 
like a very rare case, so I find it more important for the code to be as 
simple as possible than to optimize.

> move the logic to bdrv_init_padding(), and
>
> 1. allocate only one buffer
> 2. make the new collpase are to be attached to head or tail padding
> 3. avoid creating extra iov-slice, maybe with help of some new 
> qemu_iovec_* API that can control number of copied/to-be-copied iovs 
> and/or calculation number of iovs in qiov/qiov_offset/bytes slice

I’ve actually begun by trying to reuse the padding buffer, and to 
collapse head/tail into it, but found it to be rather complicated. See 
also my reply to Stefan here: 
https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04774.html

Hanna

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/4] iotests/iov-padding: New test
  2023-03-18 12:39   ` Eric Blake
@ 2023-04-03 14:23     ` Hanna Czenczek
  0 siblings, 0 replies; 21+ messages in thread
From: Hanna Czenczek @ 2023-04-03 14:23 UTC (permalink / raw)
  To: Eric Blake
  Cc: qemu-block, qemu-devel, Vladimir Sementsov-Ogievskiy, Kevin Wolf,
	Stefan Hajnoczi, Fam Zheng

On 18.03.23 13:39, Eric Blake wrote:
> On Fri, Mar 17, 2023 at 06:50:19PM +0100, Hanna Czenczek wrote:
>> Test that even vectored IO requests with 1024 vector elements that are
>> not aligned to the device's request alignment will succeed.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>> +
>> +# Four combinations:
>> +# - Offset 4096, length 1023 * 512 + 512: Fully aligned to 4k
>> +# - Offset 4096, length 1023 * 512 + 4096: Head is aligned, tail is not
> Making sure I understand - we don't care if intermediate iovs are
> smaller than the alignment requirement, because the scatter-gather
> effects will pool it together and our overall request is still
> aligned.  Correct?

Not trivial to answer, but the short answer is that we don’t care.

The long answer: Padding is only done for head and tail, so intermediate 
alignment doesn’t matter – for the interface. Internally, it does 
actually matter, because every single buffer in the vector must be 
aligned to the memory alignment, and its length must be aligned to the 
request alignment (bdrv_qiov_is_aligned()). If those requirements aren’t 
met, the file-posix driver will mark the request as QEMU_AIO_MISALIGNED, 
and emulate it (with many separate writes) instead of handing it over to 
a kernel interface that would take the full vector.

That doesn’t really matter for the test, but explains why this was no 
issue before we decided to actually internally limited I/O vectors to 
1024 elements: If the guest sees a smaller request alignment than the 
host has, and then submits a 1024-element vectored request, there is for 
sure going to be some buffer in those 1024 elements that isn’t fully 
aligned.  Therefore, qemu will emulate the request, and so it doesn’t 
matter whether we increased its length past 1024 elements.

So what it does mean for the test is that these requests’ I/O vectors 
are never going to be passed to the kernel as-is, so that code path 
isn’t exercised.  But, as laid out in the previous paragraph, that code 
path isn’t exercised with guest OSs either.

> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX
  2023-04-03 13:33     ` Hanna Czenczek
@ 2023-04-04  8:10       ` Vladimir Sementsov-Ogievskiy
  2023-04-04 17:32         ` Hanna Czenczek
  0 siblings, 1 reply; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2023-04-04  8:10 UTC (permalink / raw)
  To: Hanna Czenczek, qemu-block
  Cc: qemu-devel, Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

On 03.04.23 16:33, Hanna Czenczek wrote:
> (Sorry for the rather late reply... Thanks for the review!)
> 
> On 20.03.23 11:31, Vladimir Sementsov-Ogievskiy wrote:
>> On 17.03.23 20:50, Hanna Czenczek wrote:
> 
> [...]
> 
>>> diff --git a/block/io.c b/block/io.c
>>> index 8974d46941..1e9cdba17a 100644
>>> --- a/block/io.c
>>> +++ b/block/io.c
>>
>> [..]
>>
>>> +    pad->write = write;
>>> +
>>>       return true;
>>>   }
>>>   @@ -1545,6 +1561,18 @@ zero_mem:
>>>     static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>
>> Maybe, rename to _finalize, to stress that it's not only freeing memory.
> 
> Sounds good!
> 
> [...]
> 
>>> @@ -1552,6 +1580,101 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>       memset(pad, 0, sizeof(*pad));
>>>   }
>>>   +/*
>>> + * Create pad->local_qiov by wrapping @iov in the padding head and tail, while
>>> + * ensuring that the resulting vector will not exceed IOV_MAX elements.
>>> + *
>>> + * To ensure this, when necessary, the first couple of elements (up to three)
>>
>> maybe, "first two-three elements"
> 
> Sure (here and...
> 
> [...]
> 
>>> +    /*
>>> +     * If padded_niov > IOV_MAX, we cannot just concatenate everything.
>>> +     * Instead, merge the first couple of elements of @iov to reduce the number
>>
>> maybe, "first two-three elements"
> 
> ...here).
> 
>>
>>> +     * of vector elements as necessary.
>>> +     */
>>> +    if (padded_niov > IOV_MAX) {
>>>
>>
>> [..]
>>
>>> @@ -1653,8 +1786,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
>>>           flags |= BDRV_REQ_COPY_ON_READ;
>>>       }
>>>   -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
>>> -                           NULL, &flags);
>>> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
>>> +                           &pad, NULL, &flags);
>>>       if (ret < 0) {
>>>           goto fail;
>>>       }
>>
>> a bit later:
>>
>> tracked_request_end(&req);
>> bdrv_padding_destroy(&pad);
>>
>>
>> Now, the request is formally finished inside bdrv_padding_destroy().. Not sure, does it really violate something, but seems safer to swap these two calls. 
> 
> I’d rather not, for two reasons: First, tracked requests are (as far as I understand) only there to implement request serialization, and so only care about metadata (offset, length, and type), which is not changed by changes to the I/O vector.
> 
> Second, even if the state of the I/O vector were relevant to tracked requests, I think it would actually be the other way around, i.e. the tracked request must be ended before the padding is finalized/destroyed.  The tracked request is about the actual request we submit to `child` (which is why tracked_request_begin() is called after bdrv_pad_request()), and that request is done using the modified I/O vector.  So if the tracked request had any connection to the request’s I/O vector (which it doesn’t), it would be to this modified one, so we mustn’t invalidate it via bdrv_padding_finalize() while the tracked request lives.
> 
> Or, said differently: I generally try to clean up things in the inverse way they were set up, and because bdrv_pad_requests() comes before tracked_request_begin(), I think tracked_request_end() should come before bdrv_padding_finalize().

Note, that it's wise-versa in bdrv_co_pwritev_part().

For me it's just simpler to think that the whole request, including filling user-given qiov with data on read part is inside tracked_request_begin() / tracked_request_end(). And moving the last manipulation with qiov out of it breaks this simple thought.
Guest should not care of it, as it doesn't know about request tracking.. But what about internal code? Some code may depend on some requests be finished after bdrv_drained_begin() call, but now they may be not fully finished, and some data may be not copied back to original qiov.

I agree with your point about sequence of objects finalization, but maybe, that just shows that copying data back to qiov should not be a part of bdrv_padding_finalize(), but instead be a separate function, called before tracked_request_end().

> 
>> With that:
>>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
>>
>>
>>
>> PS, I feel here still exists small space for optimization:
> 
> The question is whether any optimization is really worth it, and I’m not sure it is.  The bug has been in qemu for over two years, and because the only report I’ve seen about it came from our QE department, it seems like a very rare case, so I find it more important for the code to be as simple as possible than to optimize.
> 
>> move the logic to bdrv_init_padding(), and
>>
>> 1. allocate only one buffer
>> 2. make the new collpase are to be attached to head or tail padding
>> 3. avoid creating extra iov-slice, maybe with help of some new qemu_iovec_* API that can control number of copied/to-be-copied iovs and/or calculation number of iovs in qiov/qiov_offset/bytes slice
> 
> I’ve actually begun by trying to reuse the padding buffer, and to collapse head/tail into it, but found it to be rather complicated. See also my reply to Stefan here: https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04774.html
> 
> Hanna
> 

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX
  2023-04-04  8:10       ` Vladimir Sementsov-Ogievskiy
@ 2023-04-04 17:32         ` Hanna Czenczek
  2023-04-05  9:59           ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 21+ messages in thread
From: Hanna Czenczek @ 2023-04-04 17:32 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: qemu-devel, Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

On 04.04.23 10:10, Vladimir Sementsov-Ogievskiy wrote:
> On 03.04.23 16:33, Hanna Czenczek wrote:
>> (Sorry for the rather late reply... Thanks for the review!)
>>
>> On 20.03.23 11:31, Vladimir Sementsov-Ogievskiy wrote:
>>> On 17.03.23 20:50, Hanna Czenczek wrote:
>>
>> [...]
>>
>>>> diff --git a/block/io.c b/block/io.c
>>>> index 8974d46941..1e9cdba17a 100644
>>>> --- a/block/io.c
>>>> +++ b/block/io.c
>>>
>>> [..]
>>>
>>>> +    pad->write = write;
>>>> +
>>>>       return true;
>>>>   }
>>>>   @@ -1545,6 +1561,18 @@ zero_mem:
>>>>     static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>
>>> Maybe, rename to _finalize, to stress that it's not only freeing 
>>> memory.
>>
>> Sounds good!
>>
>> [...]
>>
>>>> @@ -1552,6 +1580,101 @@ static void 
>>>> bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>>       memset(pad, 0, sizeof(*pad));
>>>>   }
>>>>   +/*
>>>> + * Create pad->local_qiov by wrapping @iov in the padding head and 
>>>> tail, while
>>>> + * ensuring that the resulting vector will not exceed IOV_MAX 
>>>> elements.
>>>> + *
>>>> + * To ensure this, when necessary, the first couple of elements 
>>>> (up to three)
>>>
>>> maybe, "first two-three elements"
>>
>> Sure (here and...
>>
>> [...]
>>
>>>> +    /*
>>>> +     * If padded_niov > IOV_MAX, we cannot just concatenate 
>>>> everything.
>>>> +     * Instead, merge the first couple of elements of @iov to 
>>>> reduce the number
>>>
>>> maybe, "first two-three elements"
>>
>> ...here).
>>
>>>
>>>> +     * of vector elements as necessary.
>>>> +     */
>>>> +    if (padded_niov > IOV_MAX) {
>>>>
>>>
>>> [..]
>>>
>>>> @@ -1653,8 +1786,8 @@ int coroutine_fn 
>>>> bdrv_co_preadv_part(BdrvChild *child,
>>>>           flags |= BDRV_REQ_COPY_ON_READ;
>>>>       }
>>>>   -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, 
>>>> &bytes, &pad,
>>>> -                           NULL, &flags);
>>>> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, 
>>>> &bytes, false,
>>>> +                           &pad, NULL, &flags);
>>>>       if (ret < 0) {
>>>>           goto fail;
>>>>       }
>>>
>>> a bit later:
>>>
>>> tracked_request_end(&req);
>>> bdrv_padding_destroy(&pad);
>>>
>>>
>>> Now, the request is formally finished inside 
>>> bdrv_padding_destroy().. Not sure, does it really violate something, 
>>> but seems safer to swap these two calls. 
>>
>> I’d rather not, for two reasons: First, tracked requests are (as far 
>> as I understand) only there to implement request serialization, and 
>> so only care about metadata (offset, length, and type), which is not 
>> changed by changes to the I/O vector.
>>
>> Second, even if the state of the I/O vector were relevant to tracked 
>> requests, I think it would actually be the other way around, i.e. the 
>> tracked request must be ended before the padding is 
>> finalized/destroyed.  The tracked request is about the actual request 
>> we submit to `child` (which is why tracked_request_begin() is called 
>> after bdrv_pad_request()), and that request is done using the 
>> modified I/O vector.  So if the tracked request had any connection to 
>> the request’s I/O vector (which it doesn’t), it would be to this 
>> modified one, so we mustn’t invalidate it via bdrv_padding_finalize() 
>> while the tracked request lives.
>>
>> Or, said differently: I generally try to clean up things in the 
>> inverse way they were set up, and because bdrv_pad_requests() comes 
>> before tracked_request_begin(), I think tracked_request_end() should 
>> come before bdrv_padding_finalize().
>
> Note, that it's wise-versa in bdrv_co_pwritev_part().

Well, and it’s this way here.  We agree that for clean-up, the order 
doesn’t functionally matter, so either way is actually fine.

> For me it's just simpler to think that the whole request, including 
> filling user-given qiov with data on read part is inside 
> tracked_request_begin() / tracked_request_end().

It isn’t, though, because padding must be done before the tracked 
request is created.  The tracked request uses the request’s actual 
offset and length, after padding, so bdrv_pad_request() must always be 
done before (i.e., outside) tracked_request_begin().

> And moving the last manipulation with qiov out of it breaks this 
> simple thought.
> Guest should not care of it, as it doesn't know about request 
> tracking.. But what about internal code? Some code may depend on some 
> requests be finished after bdrv_drained_begin() call, but now they may 
> be not fully finished, and some data may be not copied back to 
> original qiov.
>
> I agree with your point about sequence of objects finalization, but 
> maybe, that just shows that copying data back to qiov should not be a 
> part of bdrv_padding_finalize(), but instead be a separate function, 
> called before tracked_request_end().

But my thought is that copying back shouldn’t be done before 
tracked_request_end(), because copying back is not part of the tracked 
request.  What we track is the padded request, which uses a modified I/O 
vector, so undoing that modification shouldn’t be done while the tracked 
request lives.

I know I’m inconsistent with regards to bdrv_co_pwritev_part(), which is 
because it doesn’t matter.  My actual position is that tracked requests 
are about metadata, so undoing/finalizing the padding (including 
potentially copying data back) has nothing to do with a tracked request, 
so the order cannot of finalizing both cannot matter.

But you’re arguing for consistency, and my position on that is, if we 
want consistency, I’d finalize the tracked request first, and the 
padding second.  This is also because tracking is done for 
serialization, so we should end it as soon as possible, so that 
concurrent requests are resumed quickly.  (Though I’m not sure if 
delaying it by a memcpy() matters for an essentially single-threaded 
block layer at this time.)

Hanna



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX
  2023-04-04 17:32         ` Hanna Czenczek
@ 2023-04-05  9:59           ` Vladimir Sementsov-Ogievskiy
  2023-04-06 16:51             ` Hanna Czenczek
  0 siblings, 1 reply; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2023-04-05  9:59 UTC (permalink / raw)
  To: Hanna Czenczek, qemu-block
  Cc: qemu-devel, Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

On 04.04.23 20:32, Hanna Czenczek wrote:
> On 04.04.23 10:10, Vladimir Sementsov-Ogievskiy wrote:
>> On 03.04.23 16:33, Hanna Czenczek wrote:
>>> (Sorry for the rather late reply... Thanks for the review!)
>>>
>>> On 20.03.23 11:31, Vladimir Sementsov-Ogievskiy wrote:
>>>> On 17.03.23 20:50, Hanna Czenczek wrote:
>>>
>>> [...]
>>>
>>>>> diff --git a/block/io.c b/block/io.c
>>>>> index 8974d46941..1e9cdba17a 100644
>>>>> --- a/block/io.c
>>>>> +++ b/block/io.c
>>>>
>>>> [..]
>>>>
>>>>> +    pad->write = write;
>>>>> +
>>>>>       return true;
>>>>>   }
>>>>>   @@ -1545,6 +1561,18 @@ zero_mem:
>>>>>     static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>>
>>>> Maybe, rename to _finalize, to stress that it's not only freeing memory.
>>>
>>> Sounds good!
>>>
>>> [...]
>>>
>>>>> @@ -1552,6 +1580,101 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>>>       memset(pad, 0, sizeof(*pad));
>>>>>   }
>>>>>   +/*
>>>>> + * Create pad->local_qiov by wrapping @iov in the padding head and tail, while
>>>>> + * ensuring that the resulting vector will not exceed IOV_MAX elements.
>>>>> + *
>>>>> + * To ensure this, when necessary, the first couple of elements (up to three)
>>>>
>>>> maybe, "first two-three elements"
>>>
>>> Sure (here and...
>>>
>>> [...]
>>>
>>>>> +    /*
>>>>> +     * If padded_niov > IOV_MAX, we cannot just concatenate everything.
>>>>> +     * Instead, merge the first couple of elements of @iov to reduce the number
>>>>
>>>> maybe, "first two-three elements"
>>>
>>> ...here).
>>>
>>>>
>>>>> +     * of vector elements as necessary.
>>>>> +     */
>>>>> +    if (padded_niov > IOV_MAX) {
>>>>>
>>>>
>>>> [..]
>>>>
>>>>> @@ -1653,8 +1786,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
>>>>>           flags |= BDRV_REQ_COPY_ON_READ;
>>>>>       }
>>>>>   -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
>>>>> -                           NULL, &flags);
>>>>> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
>>>>> +                           &pad, NULL, &flags);
>>>>>       if (ret < 0) {
>>>>>           goto fail;
>>>>>       }
>>>>
>>>> a bit later:
>>>>
>>>> tracked_request_end(&req);
>>>> bdrv_padding_destroy(&pad);
>>>>
>>>>
>>>> Now, the request is formally finished inside bdrv_padding_destroy().. Not sure, does it really violate something, but seems safer to swap these two calls. 
>>>
>>> I’d rather not, for two reasons: First, tracked requests are (as far as I understand) only there to implement request serialization, and so only care about metadata (offset, length, and type), which is not changed by changes to the I/O vector.
>>>
>>> Second, even if the state of the I/O vector were relevant to tracked requests, I think it would actually be the other way around, i.e. the tracked request must be ended before the padding is finalized/destroyed.  The tracked request is about the actual request we submit to `child` (which is why tracked_request_begin() is called after bdrv_pad_request()), and that request is done using the modified I/O vector.  So if the tracked request had any connection to the request’s I/O vector (which it doesn’t), it would be to this modified one, so we mustn’t invalidate it via bdrv_padding_finalize() while the tracked request lives.
>>>
>>> Or, said differently: I generally try to clean up things in the inverse way they were set up, and because bdrv_pad_requests() comes before tracked_request_begin(), I think tracked_request_end() should come before bdrv_padding_finalize().
>>
>> Note, that it's wise-versa in bdrv_co_pwritev_part().
> 
> Well, and it’s this way here.  We agree that for clean-up, the order doesn’t functionally matter, so either way is actually fine.
> 
>> For me it's just simpler to think that the whole request, including filling user-given qiov with data on read part is inside tracked_request_begin() / tracked_request_end().
> 
> It isn’t, though, because padding must be done before the tracked request is created.  The tracked request uses the request’s actual offset and length, after padding, so bdrv_pad_request() must always be done before (i.e., outside) tracked_request_begin().
> 
>> And moving the last manipulation with qiov out of it breaks this simple thought.
>> Guest should not care of it, as it doesn't know about request tracking.. But what about internal code? Some code may depend on some requests be finished after bdrv_drained_begin() call, but now they may be not fully finished, and some data may be not copied back to original qiov.

You didn't answered here. Do you think that's wrong assumption for the user of drained sections?

>>
>> I agree with your point about sequence of objects finalization, but maybe, that just shows that copying data back to qiov should not be a part of bdrv_padding_finalize(), but instead be a separate function, called before tracked_request_end().
> 
> But my thought is that copying back shouldn’t be done before tracked_request_end(), because copying back is not part of the tracked request.  What we track is the padded request, which uses a modified I/O vector, so undoing that modification shouldn’t be done while the tracked request lives.
> 
> I know I’m inconsistent with regards to bdrv_co_pwritev_part(), which is because it doesn’t matter.  My actual position is that tracked requests are about metadata, so undoing/finalizing the padding (including potentially copying data back) has nothing to do with a tracked request, so the order cannot of finalizing both cannot matter.
> 
> But you’re arguing for consistency, and my position on that is, if we want consistency, I’d finalize the tracked request first, and the padding second.  This is also because tracking is done for serialization, so we should end it as soon as possible, so that concurrent requests are resumed quickly.  (Though I’m not sure if delaying it by a memcpy() matters for an essentially single-threaded block layer at this time.)
> 
> Hanna
> 

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/4] block: Split padded I/O vectors exceeding IOV_MAX
  2023-03-17 17:50 [PATCH 0/4] block: Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
                   ` (3 preceding siblings ...)
  2023-03-17 17:50 ` [PATCH 4/4] iotests/iov-padding: New test Hanna Czenczek
@ 2023-04-05 19:44 ` Stefan Hajnoczi
  4 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2023-04-05 19:44 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Vladimir Sementsov-Ogievskiy, Eric Blake,
	Kevin Wolf, Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 1342 bytes --]

On Fri, Mar 17, 2023 at 06:50:15PM +0100, Hanna Czenczek wrote:
> RFC:
> https://lists.nongnu.org/archive/html/qemu-block/2023-03/msg00446.html
> 
> Thanks for the feedback on the RFC!  Sounds like we agree that this is
> the right way to fix the bug.
> 
> Here in v1, I’ve followed Vladimir’s suggestion to inline the
> functionality of qemu_iovec_init_extended() in block/io.c, which, I
> think (hope), also addresses much of the feedback of Eric and Stefan.
> 
> The test is unchanged, the rest is pretty much rewritten (though in
> spirit stays the same).
> 
> 
> Hanna Czenczek (4):
>   util/iov: Make qiov_slice() public
>   block: Split padded I/O vectors exceeding IOV_MAX
>   util/iov: Remove qemu_iovec_init_extended()
>   iotests/iov-padding: New test
> 
>  include/qemu/iov.h                       |   8 +-
>  block/io.c                               | 153 +++++++++++++++++++++--
>  util/iov.c                               |  89 +++----------
>  tests/qemu-iotests/tests/iov-padding     |  85 +++++++++++++
>  tests/qemu-iotests/tests/iov-padding.out |  59 +++++++++
>  5 files changed, 306 insertions(+), 88 deletions(-)
>  create mode 100755 tests/qemu-iotests/tests/iov-padding
>  create mode 100644 tests/qemu-iotests/tests/iov-padding.out

Acked-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX
  2023-04-05  9:59           ` Vladimir Sementsov-Ogievskiy
@ 2023-04-06 16:51             ` Hanna Czenczek
  2023-04-06 21:35               ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 21+ messages in thread
From: Hanna Czenczek @ 2023-04-06 16:51 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: qemu-devel, Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

On 05.04.23 11:59, Vladimir Sementsov-Ogievskiy wrote:
> On 04.04.23 20:32, Hanna Czenczek wrote:
>> On 04.04.23 10:10, Vladimir Sementsov-Ogievskiy wrote:
>>> On 03.04.23 16:33, Hanna Czenczek wrote:
>>>> (Sorry for the rather late reply... Thanks for the review!)
>>>>
>>>> On 20.03.23 11:31, Vladimir Sementsov-Ogievskiy wrote:
>>>>> On 17.03.23 20:50, Hanna Czenczek wrote:
>>>>
>>>> [...]
>>>>
>>>>>> diff --git a/block/io.c b/block/io.c
>>>>>> index 8974d46941..1e9cdba17a 100644
>>>>>> --- a/block/io.c
>>>>>> +++ b/block/io.c
>>>>>
>>>>> [..]
>>>>>
>>>>>> +    pad->write = write;
>>>>>> +
>>>>>>       return true;
>>>>>>   }
>>>>>>   @@ -1545,6 +1561,18 @@ zero_mem:
>>>>>>     static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>>>
>>>>> Maybe, rename to _finalize, to stress that it's not only freeing 
>>>>> memory.
>>>>
>>>> Sounds good!
>>>>
>>>> [...]
>>>>
>>>>>> @@ -1552,6 +1580,101 @@ static void 
>>>>>> bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>>>>       memset(pad, 0, sizeof(*pad));
>>>>>>   }
>>>>>>   +/*
>>>>>> + * Create pad->local_qiov by wrapping @iov in the padding head 
>>>>>> and tail, while
>>>>>> + * ensuring that the resulting vector will not exceed IOV_MAX 
>>>>>> elements.
>>>>>> + *
>>>>>> + * To ensure this, when necessary, the first couple of elements 
>>>>>> (up to three)
>>>>>
>>>>> maybe, "first two-three elements"
>>>>
>>>> Sure (here and...
>>>>
>>>> [...]
>>>>
>>>>>> +    /*
>>>>>> +     * If padded_niov > IOV_MAX, we cannot just concatenate 
>>>>>> everything.
>>>>>> +     * Instead, merge the first couple of elements of @iov to 
>>>>>> reduce the number
>>>>>
>>>>> maybe, "first two-three elements"
>>>>
>>>> ...here).
>>>>
>>>>>
>>>>>> +     * of vector elements as necessary.
>>>>>> +     */
>>>>>> +    if (padded_niov > IOV_MAX) {
>>>>>>
>>>>>
>>>>> [..]
>>>>>
>>>>>> @@ -1653,8 +1786,8 @@ int coroutine_fn 
>>>>>> bdrv_co_preadv_part(BdrvChild *child,
>>>>>>           flags |= BDRV_REQ_COPY_ON_READ;
>>>>>>       }
>>>>>>   -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, 
>>>>>> &bytes, &pad,
>>>>>> -                           NULL, &flags);
>>>>>> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, 
>>>>>> &bytes, false,
>>>>>> +                           &pad, NULL, &flags);
>>>>>>       if (ret < 0) {
>>>>>>           goto fail;
>>>>>>       }
>>>>>
>>>>> a bit later:
>>>>>
>>>>> tracked_request_end(&req);
>>>>> bdrv_padding_destroy(&pad);
>>>>>
>>>>>
>>>>> Now, the request is formally finished inside 
>>>>> bdrv_padding_destroy().. Not sure, does it really violate 
>>>>> something, but seems safer to swap these two calls. 
>>>>
>>>> I’d rather not, for two reasons: First, tracked requests are (as 
>>>> far as I understand) only there to implement request serialization, 
>>>> and so only care about metadata (offset, length, and type), which 
>>>> is not changed by changes to the I/O vector.
>>>>
>>>> Second, even if the state of the I/O vector were relevant to 
>>>> tracked requests, I think it would actually be the other way 
>>>> around, i.e. the tracked request must be ended before the padding 
>>>> is finalized/destroyed.  The tracked request is about the actual 
>>>> request we submit to `child` (which is why tracked_request_begin() 
>>>> is called after bdrv_pad_request()), and that request is done using 
>>>> the modified I/O vector.  So if the tracked request had any 
>>>> connection to the request’s I/O vector (which it doesn’t), it would 
>>>> be to this modified one, so we mustn’t invalidate it via 
>>>> bdrv_padding_finalize() while the tracked request lives.
>>>>
>>>> Or, said differently: I generally try to clean up things in the 
>>>> inverse way they were set up, and because bdrv_pad_requests() comes 
>>>> before tracked_request_begin(), I think tracked_request_end() 
>>>> should come before bdrv_padding_finalize().
>>>
>>> Note, that it's wise-versa in bdrv_co_pwritev_part().
>>
>> Well, and it’s this way here.  We agree that for clean-up, the order 
>> doesn’t functionally matter, so either way is actually fine.
>>
>>> For me it's just simpler to think that the whole request, including 
>>> filling user-given qiov with data on read part is inside 
>>> tracked_request_begin() / tracked_request_end().
>>
>> It isn’t, though, because padding must be done before the tracked 
>> request is created.  The tracked request uses the request’s actual 
>> offset and length, after padding, so bdrv_pad_request() must always 
>> be done before (i.e., outside) tracked_request_begin().
>>
>>> And moving the last manipulation with qiov out of it breaks this 
>>> simple thought.
>>> Guest should not care of it, as it doesn't know about request 
>>> tracking.. But what about internal code? Some code may depend on 
>>> some requests be finished after bdrv_drained_begin() call, but now 
>>> they may be not fully finished, and some data may be not copied back 
>>> to original qiov.
>
> You didn't answered here. Do you think that's wrong assumption for the 
> user of drained sections?

Tracked requests are about request (write) serialization, they have 
nothing to do with draining.  Draining is about waiting until the 
in_flight counter is 0, i.e. waiting for bdrv_dec_in_flight(), which is 
separate from tracked_request_end() and always comes after 
bdrv_padding_finalize().

Hanna



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX
  2023-04-06 16:51             ` Hanna Czenczek
@ 2023-04-06 21:35               ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 21+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2023-04-06 21:35 UTC (permalink / raw)
  To: Hanna Czenczek, qemu-block
  Cc: qemu-devel, Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

On 06.04.23 19:51, Hanna Czenczek wrote:
> On 05.04.23 11:59, Vladimir Sementsov-Ogievskiy wrote:
>> On 04.04.23 20:32, Hanna Czenczek wrote:
>>> On 04.04.23 10:10, Vladimir Sementsov-Ogievskiy wrote:
>>>> On 03.04.23 16:33, Hanna Czenczek wrote:
>>>>> (Sorry for the rather late reply... Thanks for the review!)
>>>>>
>>>>> On 20.03.23 11:31, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> On 17.03.23 20:50, Hanna Czenczek wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>>> diff --git a/block/io.c b/block/io.c
>>>>>>> index 8974d46941..1e9cdba17a 100644
>>>>>>> --- a/block/io.c
>>>>>>> +++ b/block/io.c
>>>>>>
>>>>>> [..]
>>>>>>
>>>>>>> +    pad->write = write;
>>>>>>> +
>>>>>>>       return true;
>>>>>>>   }
>>>>>>>   @@ -1545,6 +1561,18 @@ zero_mem:
>>>>>>>     static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>>>>
>>>>>> Maybe, rename to _finalize, to stress that it's not only freeing memory.
>>>>>
>>>>> Sounds good!
>>>>>
>>>>> [...]
>>>>>
>>>>>>> @@ -1552,6 +1580,101 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>>>>>       memset(pad, 0, sizeof(*pad));
>>>>>>>   }
>>>>>>>   +/*
>>>>>>> + * Create pad->local_qiov by wrapping @iov in the padding head and tail, while
>>>>>>> + * ensuring that the resulting vector will not exceed IOV_MAX elements.
>>>>>>> + *
>>>>>>> + * To ensure this, when necessary, the first couple of elements (up to three)
>>>>>>
>>>>>> maybe, "first two-three elements"
>>>>>
>>>>> Sure (here and...
>>>>>
>>>>> [...]
>>>>>
>>>>>>> +    /*
>>>>>>> +     * If padded_niov > IOV_MAX, we cannot just concatenate everything.
>>>>>>> +     * Instead, merge the first couple of elements of @iov to reduce the number
>>>>>>
>>>>>> maybe, "first two-three elements"
>>>>>
>>>>> ...here).
>>>>>
>>>>>>
>>>>>>> +     * of vector elements as necessary.
>>>>>>> +     */
>>>>>>> +    if (padded_niov > IOV_MAX) {
>>>>>>>
>>>>>>
>>>>>> [..]
>>>>>>
>>>>>>> @@ -1653,8 +1786,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
>>>>>>>           flags |= BDRV_REQ_COPY_ON_READ;
>>>>>>>       }
>>>>>>>   -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
>>>>>>> -                           NULL, &flags);
>>>>>>> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
>>>>>>> +                           &pad, NULL, &flags);
>>>>>>>       if (ret < 0) {
>>>>>>>           goto fail;
>>>>>>>       }
>>>>>>
>>>>>> a bit later:
>>>>>>
>>>>>> tracked_request_end(&req);
>>>>>> bdrv_padding_destroy(&pad);
>>>>>>
>>>>>>
>>>>>> Now, the request is formally finished inside bdrv_padding_destroy().. Not sure, does it really violate something, but seems safer to swap these two calls. 
>>>>>
>>>>> I’d rather not, for two reasons: First, tracked requests are (as far as I understand) only there to implement request serialization, and so only care about metadata (offset, length, and type), which is not changed by changes to the I/O vector.
>>>>>
>>>>> Second, even if the state of the I/O vector were relevant to tracked requests, I think it would actually be the other way around, i.e. the tracked request must be ended before the padding is finalized/destroyed.  The tracked request is about the actual request we submit to `child` (which is why tracked_request_begin() is called after bdrv_pad_request()), and that request is done using the modified I/O vector.  So if the tracked request had any connection to the request’s I/O vector (which it doesn’t), it would be to this modified one, so we mustn’t invalidate it via bdrv_padding_finalize() while the tracked request lives.
>>>>>
>>>>> Or, said differently: I generally try to clean up things in the inverse way they were set up, and because bdrv_pad_requests() comes before tracked_request_begin(), I think tracked_request_end() should come before bdrv_padding_finalize().
>>>>
>>>> Note, that it's wise-versa in bdrv_co_pwritev_part().
>>>
>>> Well, and it’s this way here.  We agree that for clean-up, the order doesn’t functionally matter, so either way is actually fine.
>>>
>>>> For me it's just simpler to think that the whole request, including filling user-given qiov with data on read part is inside tracked_request_begin() / tracked_request_end().
>>>
>>> It isn’t, though, because padding must be done before the tracked request is created.  The tracked request uses the request’s actual offset and length, after padding, so bdrv_pad_request() must always be done before (i.e., outside) tracked_request_begin().
>>>
>>>> And moving the last manipulation with qiov out of it breaks this simple thought.
>>>> Guest should not care of it, as it doesn't know about request tracking.. But what about internal code? Some code may depend on some requests be finished after bdrv_drained_begin() call, but now they may be not fully finished, and some data may be not copied back to original qiov.
>>
>> You didn't answered here. Do you think that's wrong assumption for the user of drained sections?
> 
> Tracked requests are about request (write) serialization, they have nothing to do with draining.  Draining is about waiting until the in_flight counter is 0, i.e. waiting for bdrv_dec_in_flight(), which is separate from tracked_request_end() and always comes after bdrv_padding_finalize().
> 

Oh, right, I was wrong, sorry for long arguing.

No more objections:

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-04-06 21:37 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-17 17:50 [PATCH 0/4] block: Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
2023-03-17 17:50 ` [PATCH 1/4] util/iov: Make qiov_slice() public Hanna Czenczek
2023-03-18 12:06   ` Eric Blake
2023-03-20 10:00   ` Vladimir Sementsov-Ogievskiy
2023-03-17 17:50 ` [PATCH 2/4] block: Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
2023-03-18 12:19   ` Eric Blake
2023-03-20 10:31   ` Vladimir Sementsov-Ogievskiy
2023-04-03 13:33     ` Hanna Czenczek
2023-04-04  8:10       ` Vladimir Sementsov-Ogievskiy
2023-04-04 17:32         ` Hanna Czenczek
2023-04-05  9:59           ` Vladimir Sementsov-Ogievskiy
2023-04-06 16:51             ` Hanna Czenczek
2023-04-06 21:35               ` Vladimir Sementsov-Ogievskiy
2023-03-17 17:50 ` [PATCH 3/4] util/iov: Remove qemu_iovec_init_extended() Hanna Czenczek
2023-03-18 12:22   ` Eric Blake
2023-03-20 11:59   ` Vladimir Sementsov-Ogievskiy
2023-03-17 17:50 ` [PATCH 4/4] iotests/iov-padding: New test Hanna Czenczek
2023-03-18 12:39   ` Eric Blake
2023-04-03 14:23     ` Hanna Czenczek
2023-03-20 12:12   ` Vladimir Sementsov-Ogievskiy
2023-04-05 19:44 ` [PATCH 0/4] block: Split padded I/O vectors exceeding IOV_MAX Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).