* [PATCH for-11.0 v2 0/3] linux-aio/io-uring: Resubmit tails of short requests
@ 2026-03-24 8:43 Hanna Czenczek
2026-03-24 8:43 ` [PATCH for-11.0 v2 1/3] linux-aio: Put all parameters into qemu_laiocb Hanna Czenczek
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: Hanna Czenczek @ 2026-03-24 8:43 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Julia Suvorova,
Aarushi Mehta, Stefan Hajnoczi, Stefano Garzarella
Hi,
v1 is here:
https://lists.nongnu.org/archive/html/qemu-block/2026-03/msg00307.html
Short reads and writes can happen. One way to reproduce them is via
FUSE export, if you force it to limit the request length in the
read/write path (patch in the commit messages of patches 2 and 3), but
specifically short writes apparently can also happen with NFS.
For the file-posix block driver, aio=threads already takes care of them.
aio=native does not, at all, and aio=io_uring only handles short reads,
but not writes. This series has both aio=native and aio=io_uring handle
both short reads and writes. zone-append is not touched, as I don’t
believe resubmitting the tail (if a short append can even happen) is
safe.
v2:
- Patch 1 (kept R-b):
- Put all 32-bit fields together
- Removed unnecessary parentheses
- Patch 2 (kept R-b):
- make qemu_iovec_destroy() call contingent on qiov.iov being non-NULL
- include total_done in offset in laio_do_submit()
- Patch 3 (kept R-b):
- make qemu_iovec_destroy() call contingent on qiov.iov being non-NULL
- include total_done in offset in luring_prep_sqe()
git-backport-diff against v1:
Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively
001/3:[0007] [FC] 'linux-aio: Put all parameters into qemu_laiocb'
002/3:[0017] [FC] 'linux-aio: Resubmit tails of short reads/writes'
003/3:[0023] [FC] 'io-uring: Resubmit tails of short writes'
Hanna Czenczek (3):
linux-aio: Put all parameters into qemu_laiocb
linux-aio: Resubmit tails of short reads/writes
io-uring: Resubmit tails of short writes
block/io_uring.c | 82 +++++++++++++++++++++++-------------------
block/linux-aio.c | 88 +++++++++++++++++++++++++++++++++++++---------
block/trace-events | 2 +-
3 files changed, 117 insertions(+), 55 deletions(-)
--
2.53.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH for-11.0 v2 1/3] linux-aio: Put all parameters into qemu_laiocb
2026-03-24 8:43 [PATCH for-11.0 v2 0/3] linux-aio/io-uring: Resubmit tails of short requests Hanna Czenczek
@ 2026-03-24 8:43 ` Hanna Czenczek
2026-03-24 8:43 ` [PATCH for-11.0 v2 2/3] linux-aio: Resubmit tails of short reads/writes Hanna Czenczek
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Hanna Czenczek @ 2026-03-24 8:43 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Julia Suvorova,
Aarushi Mehta, Stefan Hajnoczi, Stefano Garzarella
Put all request parameters into the qemu_laiocb struct, which will allow
re-submitting the tail of short reads/writes.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/linux-aio.c | 34 ++++++++++++++++++++++------------
1 file changed, 22 insertions(+), 12 deletions(-)
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 53c3e9af8a..3843f45eac 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -41,9 +41,15 @@ struct qemu_laiocb {
LinuxAioState *ctx;
struct iocb iocb;
ssize_t ret;
+ off_t offset;
size_t nbytes;
QEMUIOVector *qiov;
- bool is_read;
+
+ int fd;
+ int type;
+ BdrvRequestFlags flags;
+
+ uint64_t dev_max_batch;
QSIMPLEQ_ENTRY(qemu_laiocb) next;
};
@@ -87,7 +93,7 @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
ret = 0;
} else if (ret >= 0) {
/* Short reads mean EOF, pad with zeros. */
- if (laiocb->is_read) {
+ if (laiocb->type == QEMU_AIO_READ) {
qemu_iovec_memset(laiocb->qiov, ret, 0,
laiocb->qiov->size - ret);
} else {
@@ -367,23 +373,23 @@ static void laio_deferred_fn(void *opaque)
}
}
-static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
- int type, BdrvRequestFlags flags,
- uint64_t dev_max_batch)
+static int laio_do_submit(struct qemu_laiocb *laiocb)
{
LinuxAioState *s = laiocb->ctx;
struct iocb *iocbs = &laiocb->iocb;
QEMUIOVector *qiov = laiocb->qiov;
+ int fd = laiocb->fd;
+ off_t offset = laiocb->offset;
- switch (type) {
+ switch (laiocb->type) {
case QEMU_AIO_WRITE:
#ifdef HAVE_IO_PREP_PWRITEV2
{
- int laio_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
+ int laio_flags = (laiocb->flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
io_prep_pwritev2(iocbs, fd, qiov->iov, qiov->niov, offset, laio_flags);
}
#else
- assert(flags == 0);
+ assert(laiocb->flags == 0);
io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
#endif
break;
@@ -399,7 +405,7 @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
/* Currently Linux kernel does not support other operations */
default:
fprintf(stderr, "%s: invalid AIO request type 0x%x.\n",
- __func__, type);
+ __func__, laiocb->type);
return -EIO;
}
io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
@@ -407,7 +413,7 @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
QSIMPLEQ_INSERT_TAIL(&s->io_q.pending, laiocb, next);
s->io_q.in_queue++;
if (!s->io_q.blocked) {
- if (s->io_q.in_queue >= laio_max_batch(s, dev_max_batch)) {
+ if (s->io_q.in_queue >= laio_max_batch(s, laiocb->dev_max_batch)) {
ioq_submit(s);
} else {
defer_call(laio_deferred_fn, s);
@@ -425,14 +431,18 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
AioContext *ctx = qemu_get_current_aio_context();
struct qemu_laiocb laiocb = {
.co = qemu_coroutine_self(),
+ .offset = offset,
.nbytes = qiov ? qiov->size : 0,
.ctx = aio_get_linux_aio(ctx),
.ret = -EINPROGRESS,
- .is_read = (type == QEMU_AIO_READ),
.qiov = qiov,
+ .fd = fd,
+ .type = type,
+ .flags = flags,
+ .dev_max_batch = dev_max_batch,
};
- ret = laio_do_submit(fd, &laiocb, offset, type, flags, dev_max_batch);
+ ret = laio_do_submit(&laiocb);
if (ret < 0) {
return ret;
}
--
2.53.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH for-11.0 v2 2/3] linux-aio: Resubmit tails of short reads/writes
2026-03-24 8:43 [PATCH for-11.0 v2 0/3] linux-aio/io-uring: Resubmit tails of short requests Hanna Czenczek
2026-03-24 8:43 ` [PATCH for-11.0 v2 1/3] linux-aio: Put all parameters into qemu_laiocb Hanna Czenczek
@ 2026-03-24 8:43 ` Hanna Czenczek
2026-03-24 8:43 ` [PATCH for-11.0 v2 3/3] io-uring: Resubmit tails of short writes Hanna Czenczek
2026-03-24 16:15 ` [PATCH for-11.0 v2 0/3] linux-aio/io-uring: Resubmit tails of short requests Kevin Wolf
3 siblings, 0 replies; 5+ messages in thread
From: Hanna Czenczek @ 2026-03-24 8:43 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Julia Suvorova,
Aarushi Mehta, Stefan Hajnoczi, Stefano Garzarella
Short reads/writes can happen. One way to reproduce them is via our
FUSE export, with the following diff applied (%s/escaped // to apply --
if you put plain diffs in commit messages, git-am will apply them, and I
would rather avoid breaking FUSE accidentally via this patch):
escaped diff --git a/block/export/fuse.c b/block/export/fuse.c
escaped index a2a478d293..67dc50a412 100644
escaped --- a/block/export/fuse.c
escaped +++ b/block/export/fuse.c
@@ -828,7 +828,7 @@ static ssize_t coroutine_fn GRAPH_RDLOCK
fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
const struct fuse_init_in_compat *in)
{
- const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
+ const uint32_t supported_flags = FUSE_ASYNC_READ;
if (in->major != 7) {
error_report("FUSE major version mismatch: We have 7, but kernel has %"
@@ -1060,6 +1060,8 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
void *buf;
int ret;
+ size = MIN(size, 4096);
+
/* Limited by max_read, should not happen */
if (size > FUSE_MAX_READ_BYTES) {
return -EINVAL;
@@ -1110,6 +1112,8 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
int64_t blk_len;
int ret;
+ size = MIN(size, 4096);
+
QEMU_BUILD_BUG_ON(FUSE_MAX_WRITE_BYTES > BDRV_REQUEST_MAX_BYTES);
/* Limited by max_write, should not happen */
if (size > FUSE_MAX_WRITE_BYTES) {
Then:
$ ./qemu-img create -f raw test.raw 8k
Formatting 'test.raw', fmt=raw size=8192
$ ./qemu-io -f raw -c 'write -P 42 0 8k' test.raw
wrote 8192/8192 bytes at offset 0
8 KiB, 1 ops; 00.00 sec (64.804 MiB/sec and 8294.9003 ops/sec)
$ hexdump -C test.raw
00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
*
00002000
With aio=threads, short I/O works:
$ storage-daemon/qemu-storage-daemon \
--blockdev file,node-name=test,filename=test.raw \
--export fuse,id=exp,node-name=test,mountpoint=test.raw,writable=true
Other shell:
$ ./qemu-io --image-opts -c 'read -P 42 0 8k' \
driver=file,filename=test.raw,cache.direct=on,aio=threads
read 8192/8192 bytes at offset 0
8 KiB, 1 ops; 00.00 sec (36.563 MiB/sec and 4680.0923 ops/sec)
$ ./qemu-io --image-opts -c 'write -P 23 0 8k' \
driver=file,filename=test.raw,cache.direct=on,aio=threads
wrote 8192/8192 bytes at offset 0
8 KiB, 1 ops; 00.00 sec (35.995 MiB/sec and 4607.2970 ops/sec)
$ hexdump -C test.raw
00000000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 |................|
*
00002000
But with aio=native, it does not:
$ ./qemu-io --image-opts -c 'read -P 23 0 8k' \
driver=file,filename=test.raw,cache.direct=on,aio=native
Pattern verification failed at offset 0, 8192 bytes
read 8192/8192 bytes at offset 0
8 KiB, 1 ops; 00.00 sec (86.155 MiB/sec and 11027.7900 ops/sec)
$ ./qemu-io --image-opts -c 'write -P 42 0 8k' \
driver=file,filename=test.raw,cache.direct=on,aio=native
write failed: No space left on device
$ hexdump -C test.raw
00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
*
00001000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 |................|
*
00002000
This patch fixes that.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/linux-aio.c | 56 ++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 50 insertions(+), 6 deletions(-)
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 3843f45eac..0a7424fbb3 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -45,6 +45,10 @@ struct qemu_laiocb {
size_t nbytes;
QEMUIOVector *qiov;
+ /* For handling short reads/writes */
+ size_t total_done;
+ QEMUIOVector resubmit_qiov;
+
int fd;
int type;
BdrvRequestFlags flags;
@@ -74,28 +78,61 @@ struct LinuxAioState {
};
static void ioq_submit(LinuxAioState *s);
+static int laio_do_submit(struct qemu_laiocb *laiocb);
static inline ssize_t io_event_ret(struct io_event *ev)
{
return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
}
+/**
+ * Retry tail of short requests.
+ */
+static int laio_resubmit_short_io(struct qemu_laiocb *laiocb, size_t done)
+{
+ QEMUIOVector *resubmit_qiov = &laiocb->resubmit_qiov;
+
+ laiocb->total_done += done;
+
+ if (!resubmit_qiov->iov) {
+ qemu_iovec_init(resubmit_qiov, laiocb->qiov->niov);
+ } else {
+ qemu_iovec_reset(resubmit_qiov);
+ }
+ qemu_iovec_concat(resubmit_qiov, laiocb->qiov,
+ laiocb->total_done, laiocb->nbytes - laiocb->total_done);
+
+ return laio_do_submit(laiocb);
+}
+
/*
* Completes an AIO request.
*/
static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
{
- int ret;
+ ssize_t ret;
ret = laiocb->ret;
if (ret != -ECANCELED) {
- if (ret == laiocb->nbytes) {
+ if (ret == laiocb->nbytes - laiocb->total_done) {
ret = 0;
+ } else if (ret > 0 && (laiocb->type == QEMU_AIO_READ ||
+ laiocb->type == QEMU_AIO_WRITE)) {
+ ret = laio_resubmit_short_io(laiocb, ret);
+ if (!ret) {
+ return;
+ }
} else if (ret >= 0) {
- /* Short reads mean EOF, pad with zeros. */
+ /*
+ * For normal reads and writes, we only get here if ret == 0, which
+ * means EOF for reads and ENOSPC for writes.
+ * For zone-append, we get here with any ret >= 0, which we just
+ * treat as ENOSPC, too (safer than resubmitting, probably, but not
+ * 100 % clear).
+ */
if (laiocb->type == QEMU_AIO_READ) {
- qemu_iovec_memset(laiocb->qiov, ret, 0,
- laiocb->qiov->size - ret);
+ qemu_iovec_memset(laiocb->qiov, laiocb->total_done, 0,
+ laiocb->qiov->size - laiocb->total_done);
} else {
ret = -ENOSPC;
}
@@ -103,6 +140,9 @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
}
laiocb->ret = ret;
+ if (laiocb->resubmit_qiov.iov) {
+ qemu_iovec_destroy(&laiocb->resubmit_qiov);
+ }
/*
* If the coroutine is already entered it must be in ioq_submit() and
@@ -379,7 +419,11 @@ static int laio_do_submit(struct qemu_laiocb *laiocb)
struct iocb *iocbs = &laiocb->iocb;
QEMUIOVector *qiov = laiocb->qiov;
int fd = laiocb->fd;
- off_t offset = laiocb->offset;
+ off_t offset = laiocb->offset + laiocb->total_done;
+
+ if (laiocb->resubmit_qiov.iov) {
+ qiov = &laiocb->resubmit_qiov;
+ }
switch (laiocb->type) {
case QEMU_AIO_WRITE:
--
2.53.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH for-11.0 v2 3/3] io-uring: Resubmit tails of short writes
2026-03-24 8:43 [PATCH for-11.0 v2 0/3] linux-aio/io-uring: Resubmit tails of short requests Hanna Czenczek
2026-03-24 8:43 ` [PATCH for-11.0 v2 1/3] linux-aio: Put all parameters into qemu_laiocb Hanna Czenczek
2026-03-24 8:43 ` [PATCH for-11.0 v2 2/3] linux-aio: Resubmit tails of short reads/writes Hanna Czenczek
@ 2026-03-24 8:43 ` Hanna Czenczek
2026-03-24 16:15 ` [PATCH for-11.0 v2 0/3] linux-aio/io-uring: Resubmit tails of short requests Kevin Wolf
3 siblings, 0 replies; 5+ messages in thread
From: Hanna Czenczek @ 2026-03-24 8:43 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Julia Suvorova,
Aarushi Mehta, Stefan Hajnoczi, Stefano Garzarella
Short writes can happen, too, not just short reads. The difference to
aio=native is that the kernel will actually retry the tail of short
requests internally already -- so it is harder to reproduce. But if the
tail of a short request returns an error to the kernel, we will see it
in userspace still. To reproduce this, apply the following patch on top
of the one shown in HEAD^ (again %s/escaped // to apply):
escaped diff --git a/block/export/fuse.c b/block/export/fuse.c
escaped index 67dc50a412..2b98489a32 100644
escaped --- a/block/export/fuse.c
escaped +++ b/block/export/fuse.c
@@ -1059,8 +1059,15 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
int64_t blk_len;
void *buf;
int ret;
+ static uint32_t error_size;
- size = MIN(size, 4096);
+ if (error_size == size) {
+ error_size = 0;
+ return -EIO;
+ } else if (size > 4096) {
+ error_size = size - 4096;
+ size = 4096;
+ }
/* Limited by max_read, should not happen */
if (size > FUSE_MAX_READ_BYTES) {
@@ -1111,8 +1118,15 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
{
int64_t blk_len;
int ret;
+ static uint32_t error_size;
- size = MIN(size, 4096);
+ if (error_size == size) {
+ error_size = 0;
+ return -EIO;
+ } else if (size > 4096) {
+ error_size = size - 4096;
+ size = 4096;
+ }
QEMU_BUILD_BUG_ON(FUSE_MAX_WRITE_BYTES > BDRV_REQUEST_MAX_BYTES);
/* Limited by max_write, should not happen */
I know this is a bit artificial because to produce this, there must be
an I/O error somewhere anyway, but if it does happen, qemu will
understand it to mean ENOSPC for short writes, which is incorrect. So I
believe we need to resubmit the tail to maybe have it succeed now, or at
least get the correct error code.
Reproducer as before:
$ ./qemu-img create -f raw test.raw 8k
Formatting 'test.raw', fmt=raw size=8192
$ ./qemu-io -f raw -c 'write -P 42 0 8k' test.raw
wrote 8192/8192 bytes at offset 0
8 KiB, 1 ops; 00.00 sec (64.804 MiB/sec and 8294.9003 ops/sec)
$ hexdump -C test.raw
00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
*
00002000
$ storage-daemon/qemu-storage-daemon \
--blockdev file,node-name=test,filename=test.raw \
--export fuse,id=exp,node-name=test,mountpoint=test.raw,writable=true
$ ./qemu-io --image-opts -c 'read -P 23 0 8k' \
driver=file,filename=test.raw,cache.direct=on,aio=io_uring
read 8192/8192 bytes at offset 0
8 KiB, 1 ops; 00.00 sec (58.481 MiB/sec and 7485.5342 ops/sec)
$ ./qemu-io --image-opts -c 'write -P 23 0 8k' \
driver=file,filename=test.raw,cache.direct=on,aio=io_uring
write failed: No space left on device
$ hexdump -C test.raw
00000000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 |................|
*
00001000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
*
00002000
So short reads already work (because there is code for that), but short
writes incorrectly produce ENOSPC. This patch fixes that by
resubmitting not only the tail of short reads but short writes also.
(And this patch uses the opportunity to make it so qemu_iovec_destroy()
is called only if req->resubmit_qiov.iov is non-NULL. Functionally a
non-op, but this is how the code generally checks whether the
resubmit_qiov has been set up or not.)
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/io_uring.c | 82 +++++++++++++++++++++++++---------------------
block/trace-events | 2 +-
2 files changed, 46 insertions(+), 38 deletions(-)
diff --git a/block/io_uring.c b/block/io_uring.c
index cb131d3b8b..c48a72d37e 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -27,10 +27,10 @@ typedef struct {
BdrvRequestFlags flags;
/*
- * Buffered reads may require resubmission, see
- * luring_resubmit_short_read().
+ * Short reads/writes require resubmission, see
+ * luring_resubmit_short_io().
*/
- int total_read;
+ int total_done;
QEMUIOVector resubmit_qiov;
CqeHandler cqe_handler;
@@ -40,10 +40,14 @@ static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
{
LuringRequest *req = opaque;
QEMUIOVector *qiov = req->qiov;
- uint64_t offset = req->offset;
+ uint64_t offset = req->offset + req->total_done;
int fd = req->fd;
BdrvRequestFlags flags = req->flags;
+ if (req->resubmit_qiov.iov) {
+ qiov = &req->resubmit_qiov;
+ }
+
switch (req->type) {
case QEMU_AIO_WRITE:
{
@@ -73,17 +77,12 @@ static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
break;
case QEMU_AIO_READ:
{
- if (req->resubmit_qiov.iov != NULL) {
- qiov = &req->resubmit_qiov;
- }
if (qiov->niov > 1) {
- io_uring_prep_readv(sqe, fd, qiov->iov, qiov->niov,
- offset + req->total_read);
+ io_uring_prep_readv(sqe, fd, qiov->iov, qiov->niov, offset);
} else {
/* The man page says non-vectored is faster than vectored */
struct iovec *iov = qiov->iov;
- io_uring_prep_read(sqe, fd, iov->iov_base, iov->iov_len,
- offset + req->total_read);
+ io_uring_prep_read(sqe, fd, iov->iov_base, iov->iov_len, offset);
}
break;
}
@@ -98,21 +97,26 @@ static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
}
/**
- * luring_resubmit_short_read:
+ * luring_resubmit_short_io:
*
- * Short reads are rare but may occur. The remaining read request needs to be
- * resubmitted.
+ * Short reads and writes are rare but may occur. The remaining request needs
+ * to be resubmitted.
+ *
+ * For example, short reads can be reproduced by a FUSE export deliberately
+ * executing short reads. The tail of short writes is generally resubmitted by
+ * io-uring in the kernel, but if that resubmission encounters an I/O error, the
+ * already submitted portion will be returned as a short write.
*/
-static void luring_resubmit_short_read(LuringRequest *req, int nread)
+static void luring_resubmit_short_io(LuringRequest *req, int ndone)
{
QEMUIOVector *resubmit_qiov;
size_t remaining;
- trace_luring_resubmit_short_read(req, nread);
+ trace_luring_resubmit_short_io(req, ndone);
- /* Update read position */
- req->total_read += nread;
- remaining = req->qiov->size - req->total_read;
+ /* Update I/O position */
+ req->total_done += ndone;
+ remaining = req->qiov->size - req->total_done;
/* Shorten qiov */
resubmit_qiov = &req->resubmit_qiov;
@@ -121,7 +125,7 @@ static void luring_resubmit_short_read(LuringRequest *req, int nread)
} else {
qemu_iovec_reset(resubmit_qiov);
}
- qemu_iovec_concat(resubmit_qiov, req->qiov, req->total_read, remaining);
+ qemu_iovec_concat(resubmit_qiov, req->qiov, req->total_done, remaining);
aio_add_sqe(luring_prep_sqe, req, &req->cqe_handler);
}
@@ -153,31 +157,35 @@ static void luring_cqe_handler(CqeHandler *cqe_handler)
return;
}
} else if (req->qiov) {
- /* total_read is non-zero only for resubmitted read requests */
- int total_bytes = ret + req->total_read;
+ /* total_done is non-zero only for resubmitted requests */
+ int total_bytes = ret + req->total_done;
if (total_bytes == req->qiov->size) {
ret = 0;
- } else {
+ } else if (ret > 0 && (req->type == QEMU_AIO_READ ||
+ req->type == QEMU_AIO_WRITE)) {
/* Short Read/Write */
- if (req->type == QEMU_AIO_READ) {
- if (ret > 0) {
- luring_resubmit_short_read(req, ret);
- return;
- }
-
- /* Pad with zeroes */
- qemu_iovec_memset(req->qiov, total_bytes, 0,
- req->qiov->size - total_bytes);
- ret = 0;
- } else {
- ret = -ENOSPC;
- }
+ luring_resubmit_short_io(req, ret);
+ return;
+ } else if (req->type == QEMU_AIO_READ) {
+ /* Read ret == 0: EOF, pad with zeroes */
+ qemu_iovec_memset(req->qiov, total_bytes, 0,
+ req->qiov->size - total_bytes);
+ ret = 0;
+ } else {
+ /*
+ * Normal write ret == 0 means ENOSPC.
+ * For zone-append, we treat any 0 <= ret < qiov->size as ENOSPC,
+ * too, because resubmitting the tail seems a little unsafe.
+ */
+ ret = -ENOSPC;
}
}
req->ret = ret;
- qemu_iovec_destroy(&req->resubmit_qiov);
+ if (req->resubmit_qiov.iov) {
+ qemu_iovec_destroy(&req->resubmit_qiov);
+ }
/*
* If the coroutine is already entered it must be in luring_co_submit() and
diff --git a/block/trace-events b/block/trace-events
index d170fc96f1..950c82d4b8 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -64,7 +64,7 @@ file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) "
# io_uring.c
luring_cqe_handler(void *req, int ret) "req %p ret %d"
luring_co_submit(void *bs, void *req, int fd, uint64_t offset, size_t nbytes, int type) "bs %p req %p fd %d offset %" PRId64 " nbytes %zd type %d"
-luring_resubmit_short_read(void *req, int nread) "req %p nread %d"
+luring_resubmit_short_io(void *req, int ndone) "req %p ndone %d"
# qcow2.c
qcow2_add_task(void *co, void *bs, void *pool, const char *action, int cluster_type, uint64_t host_offset, uint64_t offset, uint64_t bytes, void *qiov, size_t qiov_offset) "co %p bs %p pool %p: %s: cluster_type %d file_cluster_offset %" PRIu64 " offset %" PRIu64 " bytes %" PRIu64 " qiov %p qiov_offset %zu"
--
2.53.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH for-11.0 v2 0/3] linux-aio/io-uring: Resubmit tails of short requests
2026-03-24 8:43 [PATCH for-11.0 v2 0/3] linux-aio/io-uring: Resubmit tails of short requests Hanna Czenczek
` (2 preceding siblings ...)
2026-03-24 8:43 ` [PATCH for-11.0 v2 3/3] io-uring: Resubmit tails of short writes Hanna Czenczek
@ 2026-03-24 16:15 ` Kevin Wolf
3 siblings, 0 replies; 5+ messages in thread
From: Kevin Wolf @ 2026-03-24 16:15 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Aarushi Mehta, Stefan Hajnoczi,
Stefano Garzarella
Am 24.03.2026 um 09:43 hat Hanna Czenczek geschrieben:
> Hi,
>
> v1 is here:
>
> https://lists.nongnu.org/archive/html/qemu-block/2026-03/msg00307.html
>
> Short reads and writes can happen. One way to reproduce them is via
> FUSE export, if you force it to limit the request length in the
> read/write path (patch in the commit messages of patches 2 and 3), but
> specifically short writes apparently can also happen with NFS.
>
> For the file-posix block driver, aio=threads already takes care of them.
> aio=native does not, at all, and aio=io_uring only handles short reads,
> but not writes. This series has both aio=native and aio=io_uring handle
> both short reads and writes. zone-append is not touched, as I don’t
> believe resubmitting the tail (if a short append can even happen) is
> safe.
>
> v2:
> - Patch 1 (kept R-b):
> - Put all 32-bit fields together
> - Removed unnecessary parentheses
> - Patch 2 (kept R-b):
> - make qemu_iovec_destroy() call contingent on qiov.iov being non-NULL
> - include total_done in offset in laio_do_submit()
> - Patch 3 (kept R-b):
> - make qemu_iovec_destroy() call contingent on qiov.iov being non-NULL
> - include total_done in offset in luring_prep_sqe()
Thanks, applied to the block branch.
Kevin
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-03-24 16:16 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-24 8:43 [PATCH for-11.0 v2 0/3] linux-aio/io-uring: Resubmit tails of short requests Hanna Czenczek
2026-03-24 8:43 ` [PATCH for-11.0 v2 1/3] linux-aio: Put all parameters into qemu_laiocb Hanna Czenczek
2026-03-24 8:43 ` [PATCH for-11.0 v2 2/3] linux-aio: Resubmit tails of short reads/writes Hanna Czenczek
2026-03-24 8:43 ` [PATCH for-11.0 v2 3/3] io-uring: Resubmit tails of short writes Hanna Czenczek
2026-03-24 16:15 ` [PATCH for-11.0 v2 0/3] linux-aio/io-uring: Resubmit tails of short requests Kevin Wolf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox