* [PATCH for-11.0 0/3] linux-aio/io-uring: Resubmit tails of short requests
@ 2026-03-18 15:32 Hanna Czenczek
2026-03-18 15:32 ` [PATCH for-11.0 1/3] linux-aio: Put all parameters into qemu_laiocb Hanna Czenczek
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: Hanna Czenczek @ 2026-03-18 15:32 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Julia Suvorova,
Aarushi Mehta, Stefan Hajnoczi, Stefano Garzarella
Hi,
Short reads and writes can happen. One way to reproduce them is via
FUSE export, if you force it to limit the request length in the
read/write path (patch in the commit messages of patches 2 and 3), but
specifically short writes apparently can also happen with NFS.
For the file-posix block driver, aio=threads already takes care of them.
aio=native does not, at all, and aio=io_uring only handles short reads,
but not writes. This series has both aio=native and aio=io_uring handle
both short reads and writes. zone-append is not touched, as I don’t
believe resubmitting the tail (if a short append can even happen) is
safe.
Hanna Czenczek (3):
linux-aio: Put all parameters into qemu_laiocb
linux-aio: Resubmit tails of short reads/writes
io-uring: Resubmit tails of short writes
block/io_uring.c | 83 ++++++++++++++++++++++-----------------
block/linux-aio.c | 96 ++++++++++++++++++++++++++++++++++++----------
block/trace-events | 2 +-
3 files changed, 123 insertions(+), 58 deletions(-)
--
2.53.0
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH for-11.0 1/3] linux-aio: Put all parameters into qemu_laiocb
2026-03-18 15:32 [PATCH for-11.0 0/3] linux-aio/io-uring: Resubmit tails of short requests Hanna Czenczek
@ 2026-03-18 15:32 ` Hanna Czenczek
2026-03-23 16:36 ` Kevin Wolf
2026-03-18 15:32 ` [PATCH for-11.0 2/3] linux-aio: Resubmit tails of short reads/writes Hanna Czenczek
` (2 subsequent siblings)
3 siblings, 1 reply; 16+ messages in thread
From: Hanna Czenczek @ 2026-03-18 15:32 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Julia Suvorova,
Aarushi Mehta, Stefan Hajnoczi, Stefano Garzarella
Put all request parameters into the qemu_laiocb struct, which will allow
re-submitting the tail of short reads/writes.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/linux-aio.c | 35 ++++++++++++++++++++++-------------
1 file changed, 22 insertions(+), 13 deletions(-)
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 53c3e9af8a..1f25339dc9 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -40,10 +40,15 @@ struct qemu_laiocb {
Coroutine *co;
LinuxAioState *ctx;
struct iocb iocb;
+ int fd;
ssize_t ret;
+ off_t offset;
size_t nbytes;
QEMUIOVector *qiov;
- bool is_read;
+
+ int type;
+ BdrvRequestFlags flags;
+ uint64_t dev_max_batch;
QSIMPLEQ_ENTRY(qemu_laiocb) next;
};
@@ -87,7 +92,7 @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
ret = 0;
} else if (ret >= 0) {
/* Short reads mean EOF, pad with zeros. */
- if (laiocb->is_read) {
+ if (laiocb->type == QEMU_AIO_READ) {
qemu_iovec_memset(laiocb->qiov, ret, 0,
laiocb->qiov->size - ret);
} else {
@@ -367,23 +372,23 @@ static void laio_deferred_fn(void *opaque)
}
}
-static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
- int type, BdrvRequestFlags flags,
- uint64_t dev_max_batch)
+static int laio_do_submit(struct qemu_laiocb *laiocb)
{
LinuxAioState *s = laiocb->ctx;
struct iocb *iocbs = &laiocb->iocb;
QEMUIOVector *qiov = laiocb->qiov;
+ int fd = laiocb->fd;
+ off_t offset = laiocb->offset;
- switch (type) {
+ switch (laiocb->type) {
case QEMU_AIO_WRITE:
#ifdef HAVE_IO_PREP_PWRITEV2
{
- int laio_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
+ int laio_flags = (laiocb->flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
io_prep_pwritev2(iocbs, fd, qiov->iov, qiov->niov, offset, laio_flags);
}
#else
- assert(flags == 0);
+ assert(laiocb->flags == 0);
io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
#endif
break;
@@ -399,7 +404,7 @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
/* Currently Linux kernel does not support other operations */
default:
fprintf(stderr, "%s: invalid AIO request type 0x%x.\n",
- __func__, type);
+ __func__, laiocb->type);
return -EIO;
}
io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
@@ -407,7 +412,7 @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
QSIMPLEQ_INSERT_TAIL(&s->io_q.pending, laiocb, next);
s->io_q.in_queue++;
if (!s->io_q.blocked) {
- if (s->io_q.in_queue >= laio_max_batch(s, dev_max_batch)) {
+ if (s->io_q.in_queue >= laio_max_batch(s, laiocb->dev_max_batch)) {
ioq_submit(s);
} else {
defer_call(laio_deferred_fn, s);
@@ -425,14 +430,18 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
AioContext *ctx = qemu_get_current_aio_context();
struct qemu_laiocb laiocb = {
.co = qemu_coroutine_self(),
- .nbytes = qiov ? qiov->size : 0,
+ .fd = fd,
+ .offset = offset,
+ .nbytes = (qiov ? qiov->size : 0),
.ctx = aio_get_linux_aio(ctx),
.ret = -EINPROGRESS,
- .is_read = (type == QEMU_AIO_READ),
.qiov = qiov,
+ .type = type,
+ .flags = flags,
+ .dev_max_batch = dev_max_batch,
};
- ret = laio_do_submit(fd, &laiocb, offset, type, flags, dev_max_batch);
+ ret = laio_do_submit(&laiocb);
if (ret < 0) {
return ret;
}
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH for-11.0 2/3] linux-aio: Resubmit tails of short reads/writes
2026-03-18 15:32 [PATCH for-11.0 0/3] linux-aio/io-uring: Resubmit tails of short requests Hanna Czenczek
2026-03-18 15:32 ` [PATCH for-11.0 1/3] linux-aio: Put all parameters into qemu_laiocb Hanna Czenczek
@ 2026-03-18 15:32 ` Hanna Czenczek
2026-03-23 17:12 ` Kevin Wolf
2026-03-18 15:32 ` [PATCH for-11.0 3/3] io-uring: Resubmit tails of short writes Hanna Czenczek
2026-03-23 16:28 ` [PATCH for-11.0 0/3] linux-aio/io-uring: Resubmit tails of short requests Kevin Wolf
3 siblings, 1 reply; 16+ messages in thread
From: Hanna Czenczek @ 2026-03-18 15:32 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Julia Suvorova,
Aarushi Mehta, Stefan Hajnoczi, Stefano Garzarella
Short reads/writes can happen. One way to reproduce them is via our
FUSE export, with the following diff applied (%s/escaped // to apply --
if you put plain diffs in commit messages, git-am will apply them, and I
would rather avoid breaking FUSE accidentally via this patch):
escaped diff --git a/block/export/fuse.c b/block/export/fuse.c
escaped index a2a478d293..67dc50a412 100644
escaped --- a/block/export/fuse.c
escaped +++ b/block/export/fuse.c
@@ -828,7 +828,7 @@ static ssize_t coroutine_fn GRAPH_RDLOCK
fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
const struct fuse_init_in_compat *in)
{
- const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
+ const uint32_t supported_flags = FUSE_ASYNC_READ;
if (in->major != 7) {
error_report("FUSE major version mismatch: We have 7, but kernel has %"
@@ -1060,6 +1060,8 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
void *buf;
int ret;
+ size = MIN(size, 4096);
+
/* Limited by max_read, should not happen */
if (size > FUSE_MAX_READ_BYTES) {
return -EINVAL;
@@ -1110,6 +1112,8 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
int64_t blk_len;
int ret;
+ size = MIN(size, 4096);
+
QEMU_BUILD_BUG_ON(FUSE_MAX_WRITE_BYTES > BDRV_REQUEST_MAX_BYTES);
/* Limited by max_write, should not happen */
if (size > FUSE_MAX_WRITE_BYTES) {
Then:
$ ./qemu-img create -f raw test.raw 8k
Formatting 'test.raw', fmt=raw size=8192
$ ./qemu-io -f raw -c 'write -P 42 0 8k' test.raw
wrote 8192/8192 bytes at offset 0
8 KiB, 1 ops; 00.00 sec (64.804 MiB/sec and 8294.9003 ops/sec)
$ hexdump -C test.raw
00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
*
00002000
With aio=threads, short I/O works:
$ storage-daemon/qemu-storage-daemon \
--blockdev file,node-name=test,filename=test.raw \
--export fuse,id=exp,node-name=test,mountpoint=test.raw,writable=true
Other shell:
$ ./qemu-io --image-opts -c 'read -P 42 0 8k' \
driver=file,filename=test.raw,cache.direct=on,aio=threads
read 8192/8192 bytes at offset 0
8 KiB, 1 ops; 00.00 sec (36.563 MiB/sec and 4680.0923 ops/sec)
$ ./qemu-io --image-opts -c 'write -P 23 0 8k' \
driver=file,filename=test.raw,cache.direct=on,aio=threads
wrote 8192/8192 bytes at offset 0
8 KiB, 1 ops; 00.00 sec (35.995 MiB/sec and 4607.2970 ops/sec)
$ hexdump -C test.raw
00000000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 |................|
*
00002000
But with aio=native, it does not:
$ ./qemu-io --image-opts -c 'read -P 23 0 8k' \
driver=file,filename=test.raw,cache.direct=on,aio=native
Pattern verification failed at offset 0, 8192 bytes
read 8192/8192 bytes at offset 0
8 KiB, 1 ops; 00.00 sec (86.155 MiB/sec and 11027.7900 ops/sec)
$ ./qemu-io --image-opts -c 'write -P 42 0 8k' \
driver=file,filename=test.raw,cache.direct=on,aio=native
write failed: No space left on device
$ hexdump -C test.raw
00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
*
00001000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 |................|
*
00002000
This patch fixes that.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/linux-aio.c | 61 ++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 53 insertions(+), 8 deletions(-)
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 1f25339dc9..01621d4794 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -46,6 +46,10 @@ struct qemu_laiocb {
size_t nbytes;
QEMUIOVector *qiov;
+ /* For handling short reads/writes */
+ size_t total_done;
+ QEMUIOVector resubmit_qiov;
+
int type;
BdrvRequestFlags flags;
uint64_t dev_max_batch;
@@ -73,28 +77,61 @@ struct LinuxAioState {
};
static void ioq_submit(LinuxAioState *s);
+static int laio_do_submit(struct qemu_laiocb *laiocb);
static inline ssize_t io_event_ret(struct io_event *ev)
{
return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
}
+/**
+ * Retry tail of short requests.
+ */
+static int laio_resubmit_short_io(struct qemu_laiocb *laiocb, size_t done)
+{
+ QEMUIOVector *resubmit_qiov = &laiocb->resubmit_qiov;
+
+ laiocb->total_done += done;
+
+ if (!resubmit_qiov->iov) {
+ qemu_iovec_init(resubmit_qiov, laiocb->qiov->niov);
+ } else {
+ qemu_iovec_reset(resubmit_qiov);
+ }
+ qemu_iovec_concat(resubmit_qiov, laiocb->qiov,
+ laiocb->total_done, laiocb->nbytes - laiocb->total_done);
+
+ return laio_do_submit(laiocb);
+}
+
/*
* Completes an AIO request.
*/
static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
{
- int ret;
+ ssize_t ret;
ret = laiocb->ret;
if (ret != -ECANCELED) {
- if (ret == laiocb->nbytes) {
+ if (ret == laiocb->nbytes - laiocb->total_done) {
ret = 0;
+ } else if (ret > 0 && (laiocb->type == QEMU_AIO_READ ||
+ laiocb->type == QEMU_AIO_WRITE)) {
+ ret = laio_resubmit_short_io(laiocb, ret);
+ if (!ret) {
+ return;
+ }
} else if (ret >= 0) {
- /* Short reads mean EOF, pad with zeros. */
+ /*
+ * For normal reads and writes, we only get here if ret == 0, which
+ * means EOF for reads and ENOSPC for writes.
+ * For zone-append, we get here with any ret >= 0, which we just
+ * treat as ENOSPC, too (safer than resubmitting, probably, but not
+ * 100 % clear).
+ */
if (laiocb->type == QEMU_AIO_READ) {
- qemu_iovec_memset(laiocb->qiov, ret, 0,
- laiocb->qiov->size - ret);
+ qemu_iovec_memset(laiocb->qiov, laiocb->total_done, 0,
+ laiocb->qiov->size - laiocb->total_done);
} else {
ret = -ENOSPC;
}
@@ -102,6 +139,7 @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
}
laiocb->ret = ret;
+ qemu_iovec_destroy(&laiocb->resubmit_qiov);
/*
* If the coroutine is already entered it must be in ioq_submit() and
@@ -380,23 +418,30 @@ static int laio_do_submit(struct qemu_laiocb *laiocb)
int fd = laiocb->fd;
off_t offset = laiocb->offset;
+ if (laiocb->resubmit_qiov.iov) {
+ qiov = &laiocb->resubmit_qiov;
+ }
+
switch (laiocb->type) {
case QEMU_AIO_WRITE:
#ifdef HAVE_IO_PREP_PWRITEV2
{
int laio_flags = (laiocb->flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
- io_prep_pwritev2(iocbs, fd, qiov->iov, qiov->niov, offset, laio_flags);
+ io_prep_pwritev2(iocbs, fd, qiov->iov, qiov->niov,
+ offset + laiocb->total_done, laio_flags);
}
#else
assert(laiocb->flags == 0);
- io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
+ io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov,
+ offset + laiocb->total_done);
#endif
break;
case QEMU_AIO_ZONE_APPEND:
io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
break;
case QEMU_AIO_READ:
- io_prep_preadv(iocbs, fd, qiov->iov, qiov->niov, offset);
+ io_prep_preadv(iocbs, fd, qiov->iov, qiov->niov,
+ offset + laiocb->total_done);
break;
case QEMU_AIO_FLUSH:
io_prep_fdsync(iocbs, fd);
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH for-11.0 3/3] io-uring: Resubmit tails of short writes
2026-03-18 15:32 [PATCH for-11.0 0/3] linux-aio/io-uring: Resubmit tails of short requests Hanna Czenczek
2026-03-18 15:32 ` [PATCH for-11.0 1/3] linux-aio: Put all parameters into qemu_laiocb Hanna Czenczek
2026-03-18 15:32 ` [PATCH for-11.0 2/3] linux-aio: Resubmit tails of short reads/writes Hanna Czenczek
@ 2026-03-18 15:32 ` Hanna Czenczek
2026-03-23 19:05 ` Kevin Wolf
2026-03-23 16:28 ` [PATCH for-11.0 0/3] linux-aio/io-uring: Resubmit tails of short requests Kevin Wolf
3 siblings, 1 reply; 16+ messages in thread
From: Hanna Czenczek @ 2026-03-18 15:32 UTC (permalink / raw)
To: qemu-block
Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Julia Suvorova,
Aarushi Mehta, Stefan Hajnoczi, Stefano Garzarella
Short writes can happen, too, not just short reads. The difference to
aio=native is that the kernel will actually retry the tail of short
requests internally already -- so it is harder to reproduce. But if the
tail of a short request returns an error to the kernel, we will see it
in userspace still. To reproduce this, apply the following patch on top
of the one shown in HEAD^ (again %s/escaped // to apply):
escaped diff --git a/block/export/fuse.c b/block/export/fuse.c
escaped index 67dc50a412..2b98489a32 100644
escaped --- a/block/export/fuse.c
escaped +++ b/block/export/fuse.c
@@ -1059,8 +1059,15 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
int64_t blk_len;
void *buf;
int ret;
+ static uint32_t error_size;
- size = MIN(size, 4096);
+ if (error_size == size) {
+ error_size = 0;
+ return -EIO;
+ } else if (size > 4096) {
+ error_size = size - 4096;
+ size = 4096;
+ }
/* Limited by max_read, should not happen */
if (size > FUSE_MAX_READ_BYTES) {
@@ -1111,8 +1118,15 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
{
int64_t blk_len;
int ret;
+ static uint32_t error_size;
- size = MIN(size, 4096);
+ if (error_size == size) {
+ error_size = 0;
+ return -EIO;
+ } else if (size > 4096) {
+ error_size = size - 4096;
+ size = 4096;
+ }
QEMU_BUILD_BUG_ON(FUSE_MAX_WRITE_BYTES > BDRV_REQUEST_MAX_BYTES);
/* Limited by max_write, should not happen */
I know this is a bit artificial because to produce this, there must be
an I/O error somewhere anyway, but if it does happen, qemu will
understand it to mean ENOSPC for short writes, which is incorrect. So I
believe we need to resubmit the tail to maybe have it succeed now, or at
least get the correct error code.
Reproducer as before:
$ ./qemu-img create -f raw test.raw 8k
Formatting 'test.raw', fmt=raw size=8192
$ ./qemu-io -f raw -c 'write -P 42 0 8k' test.raw
wrote 8192/8192 bytes at offset 0
8 KiB, 1 ops; 00.00 sec (64.804 MiB/sec and 8294.9003 ops/sec)
$ hexdump -C test.raw
00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
*
00002000
$ storage-daemon/qemu-storage-daemon \
--blockdev file,node-name=test,filename=test.raw \
--export fuse,id=exp,node-name=test,mountpoint=test.raw,writable=true
$ ./qemu-io --image-opts -c 'read -P 23 0 8k' \
driver=file,filename=test.raw,cache.direct=on,aio=io_uring
read 8192/8192 bytes at offset 0
8 KiB, 1 ops; 00.00 sec (58.481 MiB/sec and 7485.5342 ops/sec)
$ ./qemu-io --image-opts -c 'write -P 23 0 8k' \
driver=file,filename=test.raw,cache.direct=on,aio=io_uring
write failed: No space left on device
$ hexdump -C test.raw
00000000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 |................|
*
00001000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
*
00002000
So short reads already work (because there is code for that), but short
writes incorrectly produce ENOSPC. This patch fixes that by
resubmitting not only the tail of short reads but short writes also.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
block/io_uring.c | 83 ++++++++++++++++++++++++++--------------------
block/trace-events | 2 +-
2 files changed, 48 insertions(+), 37 deletions(-)
diff --git a/block/io_uring.c b/block/io_uring.c
index cb131d3b8b..61b54647ae 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -27,10 +27,10 @@ typedef struct {
BdrvRequestFlags flags;
/*
- * Buffered reads may require resubmission, see
- * luring_resubmit_short_read().
+ * Short reads/writes require resubmission, see
+ * luring_resubmit_short_io().
*/
- int total_read;
+ int total_done;
QEMUIOVector resubmit_qiov;
CqeHandler cqe_handler;
@@ -44,6 +44,10 @@ static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
int fd = req->fd;
BdrvRequestFlags flags = req->flags;
+ if (req->resubmit_qiov.iov != NULL) {
+ qiov = &req->resubmit_qiov;
+ }
+
switch (req->type) {
case QEMU_AIO_WRITE:
{
@@ -51,7 +55,8 @@ static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
if (luring_flags != 0 || qiov->niov > 1) {
#ifdef HAVE_IO_URING_PREP_WRITEV2
io_uring_prep_writev2(sqe, fd, qiov->iov,
- qiov->niov, offset, luring_flags);
+ qiov->niov, offset + req->total_done,
+ luring_flags);
#else
/*
* FUA should only be enabled with HAVE_IO_URING_PREP_WRITEV2, see
@@ -59,12 +64,14 @@ static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
*/
assert(luring_flags == 0);
- io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov, offset);
+ io_uring_prep_writev(sqe, fd, qiov->iov, qiov->niov,
+ offset + req->total_done);
#endif
} else {
/* The man page says non-vectored is faster than vectored */
struct iovec *iov = qiov->iov;
- io_uring_prep_write(sqe, fd, iov->iov_base, iov->iov_len, offset);
+ io_uring_prep_write(sqe, fd, iov->iov_base, iov->iov_len,
+ offset + req->total_done);
}
break;
}
@@ -73,17 +80,14 @@ static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
break;
case QEMU_AIO_READ:
{
- if (req->resubmit_qiov.iov != NULL) {
- qiov = &req->resubmit_qiov;
- }
if (qiov->niov > 1) {
io_uring_prep_readv(sqe, fd, qiov->iov, qiov->niov,
- offset + req->total_read);
+ offset + req->total_done);
} else {
/* The man page says non-vectored is faster than vectored */
struct iovec *iov = qiov->iov;
io_uring_prep_read(sqe, fd, iov->iov_base, iov->iov_len,
- offset + req->total_read);
+ offset + req->total_done);
}
break;
}
@@ -98,21 +102,26 @@ static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
}
/**
- * luring_resubmit_short_read:
+ * luring_resubmit_short_io:
*
- * Short reads are rare but may occur. The remaining read request needs to be
- * resubmitted.
+ * Short reads and writes are rare but may occur. The remaining request needs
+ * to be resubmitted.
+ *
+ * For example, short reads can be reproduced by a FUSE export deliberately
+ * executing short reads. The tail of short writes is generally resubmitted by
+ * io-uring in the kernel, but if that resubmission encounters an I/O error, the
+ * already submitted portion will be returned as a short write.
*/
-static void luring_resubmit_short_read(LuringRequest *req, int nread)
+static void luring_resubmit_short_io(LuringRequest *req, int ndone)
{
QEMUIOVector *resubmit_qiov;
size_t remaining;
- trace_luring_resubmit_short_read(req, nread);
+ trace_luring_resubmit_short_io(req, ndone);
- /* Update read position */
- req->total_read += nread;
- remaining = req->qiov->size - req->total_read;
+ /* Update I/O position */
+ req->total_done += ndone;
+ remaining = req->qiov->size - req->total_done;
/* Shorten qiov */
resubmit_qiov = &req->resubmit_qiov;
@@ -121,7 +130,7 @@ static void luring_resubmit_short_read(LuringRequest *req, int nread)
} else {
qemu_iovec_reset(resubmit_qiov);
}
- qemu_iovec_concat(resubmit_qiov, req->qiov, req->total_read, remaining);
+ qemu_iovec_concat(resubmit_qiov, req->qiov, req->total_done, remaining);
aio_add_sqe(luring_prep_sqe, req, &req->cqe_handler);
}
@@ -153,26 +162,28 @@ static void luring_cqe_handler(CqeHandler *cqe_handler)
return;
}
} else if (req->qiov) {
- /* total_read is non-zero only for resubmitted read requests */
- int total_bytes = ret + req->total_read;
+ /* total_done is non-zero only for resubmitted requests */
+ int total_bytes = ret + req->total_done;
if (total_bytes == req->qiov->size) {
ret = 0;
- } else {
+ } else if (ret > 0 && (req->type == QEMU_AIO_READ ||
+ req->type == QEMU_AIO_WRITE)) {
/* Short Read/Write */
- if (req->type == QEMU_AIO_READ) {
- if (ret > 0) {
- luring_resubmit_short_read(req, ret);
- return;
- }
-
- /* Pad with zeroes */
- qemu_iovec_memset(req->qiov, total_bytes, 0,
- req->qiov->size - total_bytes);
- ret = 0;
- } else {
- ret = -ENOSPC;
- }
+ luring_resubmit_short_io(req, ret);
+ return;
+ } else if (req->type == QEMU_AIO_READ) {
+ /* Read ret == 0: EOF, pad with zeroes */
+ qemu_iovec_memset(req->qiov, total_bytes, 0,
+ req->qiov->size - total_bytes);
+ ret = 0;
+ } else {
+ /*
+ * Normal write ret == 0 means ENOSPC.
+ * For zone-append, we treat any 0 <= ret < qiov->size as ENOSPC,
+ * too, because resubmitting the tail seems a little unsafe.
+ */
+ ret = -ENOSPC;
}
}
diff --git a/block/trace-events b/block/trace-events
index d170fc96f1..950c82d4b8 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -64,7 +64,7 @@ file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) "
# io_uring.c
luring_cqe_handler(void *req, int ret) "req %p ret %d"
luring_co_submit(void *bs, void *req, int fd, uint64_t offset, size_t nbytes, int type) "bs %p req %p fd %d offset %" PRId64 " nbytes %zd type %d"
-luring_resubmit_short_read(void *req, int nread) "req %p nread %d"
+luring_resubmit_short_io(void *req, int ndone) "req %p ndone %d"
# qcow2.c
qcow2_add_task(void *co, void *bs, void *pool, const char *action, int cluster_type, uint64_t host_offset, uint64_t offset, uint64_t bytes, void *qiov, size_t qiov_offset) "co %p bs %p pool %p: %s: cluster_type %d file_cluster_offset %" PRIu64 " offset %" PRIu64 " bytes %" PRIu64 " qiov %p qiov_offset %zu"
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH for-11.0 0/3] linux-aio/io-uring: Resubmit tails of short requests
2026-03-18 15:32 [PATCH for-11.0 0/3] linux-aio/io-uring: Resubmit tails of short requests Hanna Czenczek
` (2 preceding siblings ...)
2026-03-18 15:32 ` [PATCH for-11.0 3/3] io-uring: Resubmit tails of short writes Hanna Czenczek
@ 2026-03-23 16:28 ` Kevin Wolf
2026-03-23 16:59 ` Hanna Czenczek
3 siblings, 1 reply; 16+ messages in thread
From: Kevin Wolf @ 2026-03-23 16:28 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Aarushi Mehta, Stefan Hajnoczi,
Stefano Garzarella
Am 18.03.2026 um 16:32 hat Hanna Czenczek geschrieben:
> Hi,
>
> Short reads and writes can happen. One way to reproduce them is via
> FUSE export, if you force it to limit the request length in the
> read/write path (patch in the commit messages of patches 2 and 3), but
> specifically short writes apparently can also happen with NFS.
>
> For the file-posix block driver, aio=threads already takes care of them.
> aio=native does not, at all, and aio=io_uring only handles short reads,
> but not writes. This series has both aio=native and aio=io_uring handle
> both short reads and writes. zone-append is not touched, as I don’t
> believe resubmitting the tail (if a short append can even happen) is
> safe.
Originally, I think you had intended to loop in coroutine code. Now it
looks like you decided to resubmit already on the lower level directly
when processing completions. Obviously, both approaches work, but I'm
curious what are the reasons that made you pick this approach after all.
Kevin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH for-11.0 1/3] linux-aio: Put all parameters into qemu_laiocb
2026-03-18 15:32 ` [PATCH for-11.0 1/3] linux-aio: Put all parameters into qemu_laiocb Hanna Czenczek
@ 2026-03-23 16:36 ` Kevin Wolf
2026-03-23 17:02 ` Hanna Czenczek
0 siblings, 1 reply; 16+ messages in thread
From: Kevin Wolf @ 2026-03-23 16:36 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Aarushi Mehta, Stefan Hajnoczi,
Stefano Garzarella
Am 18.03.2026 um 16:32 hat Hanna Czenczek geschrieben:
> Put all request parameters into the qemu_laiocb struct, which will allow
> re-submitting the tail of short reads/writes.
>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
> block/linux-aio.c | 35 ++++++++++++++++++++++-------------
> 1 file changed, 22 insertions(+), 13 deletions(-)
>
> diff --git a/block/linux-aio.c b/block/linux-aio.c
> index 53c3e9af8a..1f25339dc9 100644
> --- a/block/linux-aio.c
> +++ b/block/linux-aio.c
> @@ -40,10 +40,15 @@ struct qemu_laiocb {
> Coroutine *co;
> LinuxAioState *ctx;
> struct iocb iocb;
> + int fd;
> ssize_t ret;
> + off_t offset;
> size_t nbytes;
> QEMUIOVector *qiov;
> - bool is_read;
> +
> + int type;
If you put fd and type next to each other, you'd avoid a hole in the
struct.
> + BdrvRequestFlags flags;
> + uint64_t dev_max_batch;
> QSIMPLEQ_ENTRY(qemu_laiocb) next;
> };
>
> @@ -87,7 +92,7 @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
> ret = 0;
> } else if (ret >= 0) {
> /* Short reads mean EOF, pad with zeros. */
> - if (laiocb->is_read) {
> + if (laiocb->type == QEMU_AIO_READ) {
> qemu_iovec_memset(laiocb->qiov, ret, 0,
> laiocb->qiov->size - ret);
> } else {
> @@ -367,23 +372,23 @@ static void laio_deferred_fn(void *opaque)
> }
> }
>
> -static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
> - int type, BdrvRequestFlags flags,
> - uint64_t dev_max_batch)
> +static int laio_do_submit(struct qemu_laiocb *laiocb)
> {
> LinuxAioState *s = laiocb->ctx;
> struct iocb *iocbs = &laiocb->iocb;
> QEMUIOVector *qiov = laiocb->qiov;
> + int fd = laiocb->fd;
> + off_t offset = laiocb->offset;
>
> - switch (type) {
> + switch (laiocb->type) {
> case QEMU_AIO_WRITE:
> #ifdef HAVE_IO_PREP_PWRITEV2
> {
> - int laio_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
> + int laio_flags = (laiocb->flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
> io_prep_pwritev2(iocbs, fd, qiov->iov, qiov->niov, offset, laio_flags);
> }
> #else
> - assert(flags == 0);
> + assert(laiocb->flags == 0);
> io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
> #endif
> break;
> @@ -399,7 +404,7 @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
> /* Currently Linux kernel does not support other operations */
> default:
> fprintf(stderr, "%s: invalid AIO request type 0x%x.\n",
> - __func__, type);
> + __func__, laiocb->type);
> return -EIO;
> }
> io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
> @@ -407,7 +412,7 @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
> QSIMPLEQ_INSERT_TAIL(&s->io_q.pending, laiocb, next);
> s->io_q.in_queue++;
> if (!s->io_q.blocked) {
> - if (s->io_q.in_queue >= laio_max_batch(s, dev_max_batch)) {
> + if (s->io_q.in_queue >= laio_max_batch(s, laiocb->dev_max_batch)) {
> ioq_submit(s);
> } else {
> defer_call(laio_deferred_fn, s);
> @@ -425,14 +430,18 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
> AioContext *ctx = qemu_get_current_aio_context();
> struct qemu_laiocb laiocb = {
> .co = qemu_coroutine_self(),
> - .nbytes = qiov ? qiov->size : 0,
> + .fd = fd,
> + .offset = offset,
> + .nbytes = (qiov ? qiov->size : 0),
Adding parentheses around the expression seems both unrelated and
unnecessary.
> .ctx = aio_get_linux_aio(ctx),
> .ret = -EINPROGRESS,
> - .is_read = (type == QEMU_AIO_READ),
> .qiov = qiov,
> + .type = type,
> + .flags = flags,
> + .dev_max_batch = dev_max_batch,
> };
>
> - ret = laio_do_submit(fd, &laiocb, offset, type, flags, dev_max_batch);
> + ret = laio_do_submit(&laiocb);
> if (ret < 0) {
> return ret;
> }
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH for-11.0 0/3] linux-aio/io-uring: Resubmit tails of short requests
2026-03-23 16:28 ` [PATCH for-11.0 0/3] linux-aio/io-uring: Resubmit tails of short requests Kevin Wolf
@ 2026-03-23 16:59 ` Hanna Czenczek
0 siblings, 0 replies; 16+ messages in thread
From: Hanna Czenczek @ 2026-03-23 16:59 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-block, qemu-devel, Aarushi Mehta, Stefan Hajnoczi,
Stefano Garzarella
On 23.03.26 17:28, Kevin Wolf wrote:
> Am 18.03.2026 um 16:32 hat Hanna Czenczek geschrieben:
>> Hi,
>>
>> Short reads and writes can happen. One way to reproduce them is via
>> FUSE export, if you force it to limit the request length in the
>> read/write path (patch in the commit messages of patches 2 and 3), but
>> specifically short writes apparently can also happen with NFS.
>>
>> For the file-posix block driver, aio=threads already takes care of them.
>> aio=native does not, at all, and aio=io_uring only handles short reads,
>> but not writes. This series has both aio=native and aio=io_uring handle
>> both short reads and writes. zone-append is not touched, as I don’t
>> believe resubmitting the tail (if a short append can even happen) is
>> safe.
> Originally, I think you had intended to loop in coroutine code. Now it
> looks like you decided to resubmit already on the lower level directly
> when processing completions. Obviously, both approaches work, but I'm
> curious what are the reasons that made you pick this approach after all.
io-uring did it this way, so I thought for a series for the freeze
period, it would be safest to just copy what it does.
Hanna
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH for-11.0 1/3] linux-aio: Put all parameters into qemu_laiocb
2026-03-23 16:36 ` Kevin Wolf
@ 2026-03-23 17:02 ` Hanna Czenczek
2026-03-23 17:04 ` Hanna Czenczek
0 siblings, 1 reply; 16+ messages in thread
From: Hanna Czenczek @ 2026-03-23 17:02 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-block, qemu-devel, Aarushi Mehta, Stefan Hajnoczi,
Stefano Garzarella
On 23.03.26 17:36, Kevin Wolf wrote:
> Am 18.03.2026 um 16:32 hat Hanna Czenczek geschrieben:
>> Put all request parameters into the qemu_laiocb struct, which will allow
>> re-submitting the tail of short reads/writes.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>> block/linux-aio.c | 35 ++++++++++++++++++++++-------------
>> 1 file changed, 22 insertions(+), 13 deletions(-)
>>
>> diff --git a/block/linux-aio.c b/block/linux-aio.c
>> index 53c3e9af8a..1f25339dc9 100644
>> --- a/block/linux-aio.c
>> +++ b/block/linux-aio.c
>> @@ -40,10 +40,15 @@ struct qemu_laiocb {
>> Coroutine *co;
>> LinuxAioState *ctx;
>> struct iocb iocb;
>> + int fd;
>> ssize_t ret;
>> + off_t offset;
>> size_t nbytes;
>> QEMUIOVector *qiov;
>> - bool is_read;
>> +
>> + int type;
> If you put fd and type next to each other, you'd avoid a hole in the
> struct.
And I assume pulling dev_max_batch forward, after qiov, yes.
>> + BdrvRequestFlags flags;
>> + uint64_t dev_max_batch;
>> QSIMPLEQ_ENTRY(qemu_laiocb) next;
>> };
>>
>> @@ -87,7 +92,7 @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
>> ret = 0;
>> } else if (ret >= 0) {
>> /* Short reads mean EOF, pad with zeros. */
>> - if (laiocb->is_read) {
>> + if (laiocb->type == QEMU_AIO_READ) {
>> qemu_iovec_memset(laiocb->qiov, ret, 0,
>> laiocb->qiov->size - ret);
>> } else {
>> @@ -367,23 +372,23 @@ static void laio_deferred_fn(void *opaque)
>> }
>> }
>>
>> -static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
>> - int type, BdrvRequestFlags flags,
>> - uint64_t dev_max_batch)
>> +static int laio_do_submit(struct qemu_laiocb *laiocb)
>> {
>> LinuxAioState *s = laiocb->ctx;
>> struct iocb *iocbs = &laiocb->iocb;
>> QEMUIOVector *qiov = laiocb->qiov;
>> + int fd = laiocb->fd;
>> + off_t offset = laiocb->offset;
>>
>> - switch (type) {
>> + switch (laiocb->type) {
>> case QEMU_AIO_WRITE:
>> #ifdef HAVE_IO_PREP_PWRITEV2
>> {
>> - int laio_flags = (flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
>> + int laio_flags = (laiocb->flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
>> io_prep_pwritev2(iocbs, fd, qiov->iov, qiov->niov, offset, laio_flags);
>> }
>> #else
>> - assert(flags == 0);
>> + assert(laiocb->flags == 0);
>> io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
>> #endif
>> break;
>> @@ -399,7 +404,7 @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
>> /* Currently Linux kernel does not support other operations */
>> default:
>> fprintf(stderr, "%s: invalid AIO request type 0x%x.\n",
>> - __func__, type);
>> + __func__, laiocb->type);
>> return -EIO;
>> }
>> io_set_eventfd(&laiocb->iocb, event_notifier_get_fd(&s->e));
>> @@ -407,7 +412,7 @@ static int laio_do_submit(int fd, struct qemu_laiocb *laiocb, off_t offset,
>> QSIMPLEQ_INSERT_TAIL(&s->io_q.pending, laiocb, next);
>> s->io_q.in_queue++;
>> if (!s->io_q.blocked) {
>> - if (s->io_q.in_queue >= laio_max_batch(s, dev_max_batch)) {
>> + if (s->io_q.in_queue >= laio_max_batch(s, laiocb->dev_max_batch)) {
>> ioq_submit(s);
>> } else {
>> defer_call(laio_deferred_fn, s);
>> @@ -425,14 +430,18 @@ int coroutine_fn laio_co_submit(int fd, uint64_t offset, QEMUIOVector *qiov,
>> AioContext *ctx = qemu_get_current_aio_context();
>> struct qemu_laiocb laiocb = {
>> .co = qemu_coroutine_self(),
>> - .nbytes = qiov ? qiov->size : 0,
>> + .fd = fd,
>> + .offset = offset,
>> + .nbytes = (qiov ? qiov->size : 0),
> Adding parentheses around the expression seems both unrelated and
> unnecessary.
True. Sorry, artifact from a prior version, I think.
>> .ctx = aio_get_linux_aio(ctx),
>> .ret = -EINPROGRESS,
>> - .is_read = (type == QEMU_AIO_READ),
>> .qiov = qiov,
>> + .type = type,
>> + .flags = flags,
>> + .dev_max_batch = dev_max_batch,
>> };
>>
>> - ret = laio_do_submit(fd, &laiocb, offset, type, flags, dev_max_batch);
>> + ret = laio_do_submit(&laiocb);
>> if (ret < 0) {
>> return ret;
>> }
> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Thanks!
Hanna
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH for-11.0 1/3] linux-aio: Put all parameters into qemu_laiocb
2026-03-23 17:02 ` Hanna Czenczek
@ 2026-03-23 17:04 ` Hanna Czenczek
2026-03-23 19:10 ` Kevin Wolf
0 siblings, 1 reply; 16+ messages in thread
From: Hanna Czenczek @ 2026-03-23 17:04 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-block, qemu-devel, Aarushi Mehta, Stefan Hajnoczi,
Stefano Garzarella
On 23.03.26 18:02, Hanna Czenczek wrote:
> On 23.03.26 17:36, Kevin Wolf wrote:
>> Am 18.03.2026 um 16:32 hat Hanna Czenczek geschrieben:
>>> Put all request parameters into the qemu_laiocb struct, which will
>>> allow
>>> re-submitting the tail of short reads/writes.
>>>
>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>> ---
>>> block/linux-aio.c | 35 ++++++++++++++++++++++-------------
>>> 1 file changed, 22 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/block/linux-aio.c b/block/linux-aio.c
>>> index 53c3e9af8a..1f25339dc9 100644
>>> --- a/block/linux-aio.c
>>> +++ b/block/linux-aio.c
>>> @@ -40,10 +40,15 @@ struct qemu_laiocb {
>>> Coroutine *co;
>>> LinuxAioState *ctx;
>>> struct iocb iocb;
>>> + int fd;
>>> ssize_t ret;
>>> + off_t offset;
>>> size_t nbytes;
>>> QEMUIOVector *qiov;
>>> - bool is_read;
>>> +
>>> + int type;
>> If you put fd and type next to each other, you'd avoid a hole in the
>> struct.
>
> And I assume pulling dev_max_batch forward, after qiov, yes.
Noticed while sending: I missed `.next` for some reason. So no reason to
pull dev_max_batch forward, and I guess there will always be a hole, but
I suppose it is still best to put all 32-bit ints next to each other
(fd, type, flags) so filling that up in the future is more likely.
Hanna
>>> + BdrvRequestFlags flags;
>>> + uint64_t dev_max_batch;
>>> QSIMPLEQ_ENTRY(qemu_laiocb) next;
>>> };
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH for-11.0 2/3] linux-aio: Resubmit tails of short reads/writes
2026-03-18 15:32 ` [PATCH for-11.0 2/3] linux-aio: Resubmit tails of short reads/writes Hanna Czenczek
@ 2026-03-23 17:12 ` Kevin Wolf
2026-03-24 8:12 ` Hanna Czenczek
0 siblings, 1 reply; 16+ messages in thread
From: Kevin Wolf @ 2026-03-23 17:12 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Aarushi Mehta, Stefan Hajnoczi,
Stefano Garzarella
Am 18.03.2026 um 16:32 hat Hanna Czenczek geschrieben:
> Short reads/writes can happen. One way to reproduce them is via our
> FUSE export, with the following diff applied (%s/escaped // to apply --
> if you put plain diffs in commit messages, git-am will apply them, and I
> would rather avoid breaking FUSE accidentally via this patch):
>
> escaped diff --git a/block/export/fuse.c b/block/export/fuse.c
> escaped index a2a478d293..67dc50a412 100644
> escaped --- a/block/export/fuse.c
> escaped +++ b/block/export/fuse.c
> @@ -828,7 +828,7 @@ static ssize_t coroutine_fn GRAPH_RDLOCK
> fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
> const struct fuse_init_in_compat *in)
> {
> - const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
> + const uint32_t supported_flags = FUSE_ASYNC_READ;
>
> if (in->major != 7) {
> error_report("FUSE major version mismatch: We have 7, but kernel has %"
> @@ -1060,6 +1060,8 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
> void *buf;
> int ret;
>
> + size = MIN(size, 4096);
> +
> /* Limited by max_read, should not happen */
> if (size > FUSE_MAX_READ_BYTES) {
> return -EINVAL;
> @@ -1110,6 +1112,8 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
> int64_t blk_len;
> int ret;
>
> + size = MIN(size, 4096);
> +
> QEMU_BUILD_BUG_ON(FUSE_MAX_WRITE_BYTES > BDRV_REQUEST_MAX_BYTES);
> /* Limited by max_write, should not happen */
> if (size > FUSE_MAX_WRITE_BYTES) {
>
> Then:
> $ ./qemu-img create -f raw test.raw 8k
> Formatting 'test.raw', fmt=raw size=8192
> $ ./qemu-io -f raw -c 'write -P 42 0 8k' test.raw
> wrote 8192/8192 bytes at offset 0
> 8 KiB, 1 ops; 00.00 sec (64.804 MiB/sec and 8294.9003 ops/sec)
> $ hexdump -C test.raw
> 00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
> *
> 00002000
>
> With aio=threads, short I/O works:
> $ storage-daemon/qemu-storage-daemon \
> --blockdev file,node-name=test,filename=test.raw \
> --export fuse,id=exp,node-name=test,mountpoint=test.raw,writable=true
>
> Other shell:
> $ ./qemu-io --image-opts -c 'read -P 42 0 8k' \
> driver=file,filename=test.raw,cache.direct=on,aio=threads
> read 8192/8192 bytes at offset 0
> 8 KiB, 1 ops; 00.00 sec (36.563 MiB/sec and 4680.0923 ops/sec)
> $ ./qemu-io --image-opts -c 'write -P 23 0 8k' \
> driver=file,filename=test.raw,cache.direct=on,aio=threads
> wrote 8192/8192 bytes at offset 0
> 8 KiB, 1 ops; 00.00 sec (35.995 MiB/sec and 4607.2970 ops/sec)
> $ hexdump -C test.raw
> 00000000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 |................|
> *
> 00002000
>
> But with aio=native, it does not:
> $ ./qemu-io --image-opts -c 'read -P 23 0 8k' \
> driver=file,filename=test.raw,cache.direct=on,aio=native
> Pattern verification failed at offset 0, 8192 bytes
> read 8192/8192 bytes at offset 0
> 8 KiB, 1 ops; 00.00 sec (86.155 MiB/sec and 11027.7900 ops/sec)
> $ ./qemu-io --image-opts -c 'write -P 42 0 8k' \
> driver=file,filename=test.raw,cache.direct=on,aio=native
> write failed: No space left on device
> $ hexdump -C test.raw
> 00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
> *
> 00001000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 |................|
> *
> 00002000
>
> This patch fixes that.
>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
> block/linux-aio.c | 61 ++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 53 insertions(+), 8 deletions(-)
>
> diff --git a/block/linux-aio.c b/block/linux-aio.c
> index 1f25339dc9..01621d4794 100644
> --- a/block/linux-aio.c
> +++ b/block/linux-aio.c
> @@ -46,6 +46,10 @@ struct qemu_laiocb {
> size_t nbytes;
> QEMUIOVector *qiov;
>
> + /* For handling short reads/writes */
> + size_t total_done;
> + QEMUIOVector resubmit_qiov;
> +
> int type;
> BdrvRequestFlags flags;
> uint64_t dev_max_batch;
> @@ -73,28 +77,61 @@ struct LinuxAioState {
> };
>
> static void ioq_submit(LinuxAioState *s);
> +static int laio_do_submit(struct qemu_laiocb *laiocb);
>
> static inline ssize_t io_event_ret(struct io_event *ev)
> {
> return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
> }
>
> +/**
> + * Retry tail of short requests.
> + */
> +static int laio_resubmit_short_io(struct qemu_laiocb *laiocb, size_t done)
> +{
> + QEMUIOVector *resubmit_qiov = &laiocb->resubmit_qiov;
> +
> + laiocb->total_done += done;
> +
> + if (!resubmit_qiov->iov) {
> + qemu_iovec_init(resubmit_qiov, laiocb->qiov->niov);
> + } else {
> + qemu_iovec_reset(resubmit_qiov);
> + }
> + qemu_iovec_concat(resubmit_qiov, laiocb->qiov,
> + laiocb->total_done, laiocb->nbytes - laiocb->total_done);
> +
> + return laio_do_submit(laiocb);
> +}
> +
> /*
> * Completes an AIO request.
> */
> static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
> {
> - int ret;
> + ssize_t ret;
>
> ret = laiocb->ret;
> if (ret != -ECANCELED) {
> - if (ret == laiocb->nbytes) {
> + if (ret == laiocb->nbytes - laiocb->total_done) {
> ret = 0;
> + } else if (ret > 0 && (laiocb->type == QEMU_AIO_READ ||
> + laiocb->type == QEMU_AIO_WRITE)) {
> + ret = laio_resubmit_short_io(laiocb, ret);
> + if (!ret) {
> + return;
> + }
> } else if (ret >= 0) {
> - /* Short reads mean EOF, pad with zeros. */
> + /*
> + * For normal reads and writes, we only get here if ret == 0, which
> + * means EOF for reads and ENOSPC for writes.
> + * For zone-append, we get here with any ret >= 0, which we just
> + * treat as ENOSPC, too (safer than resubmitting, probably, but not
> + * 100 % clear).
> + */
> if (laiocb->type == QEMU_AIO_READ) {
> - qemu_iovec_memset(laiocb->qiov, ret, 0,
> - laiocb->qiov->size - ret);
> + qemu_iovec_memset(laiocb->qiov, laiocb->total_done, 0,
> + laiocb->qiov->size - laiocb->total_done);
> } else {
> ret = -ENOSPC;
> }
> @@ -102,6 +139,7 @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
> }
>
> laiocb->ret = ret;
> + qemu_iovec_destroy(&laiocb->resubmit_qiov);
Calling qemu_iovec_destroy() for a qiov that has potentially never been
initialised feels a bit unsafe to me. It will work in practice for the
current implementation, but maybe make this one conditional on
laiocb->resubmit_qiov.iov, too? (Which is already making assumptions
about the internals of QEMUIOVector, but that we'll have an iov after
initialising the qiov with qemu_iovec_init() above will probably never
change.)
> /*
> * If the coroutine is already entered it must be in ioq_submit() and
> @@ -380,23 +418,30 @@ static int laio_do_submit(struct qemu_laiocb *laiocb)
> int fd = laiocb->fd;
> off_t offset = laiocb->offset;
I wonder if making it laiocb->offset + laiocb->total_done here wouldn't
be more robust than having the addition in every call below.
> + if (laiocb->resubmit_qiov.iov) {
> + qiov = &laiocb->resubmit_qiov;
> + }
> +
> switch (laiocb->type) {
> case QEMU_AIO_WRITE:
> #ifdef HAVE_IO_PREP_PWRITEV2
> {
> int laio_flags = (laiocb->flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
> - io_prep_pwritev2(iocbs, fd, qiov->iov, qiov->niov, offset, laio_flags);
> + io_prep_pwritev2(iocbs, fd, qiov->iov, qiov->niov,
> + offset + laiocb->total_done, laio_flags);
> }
> #else
> assert(laiocb->flags == 0);
> - io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
> + io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov,
> + offset + laiocb->total_done);
> #endif
> break;
> case QEMU_AIO_ZONE_APPEND:
> io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
> break;
> case QEMU_AIO_READ:
> - io_prep_preadv(iocbs, fd, qiov->iov, qiov->niov, offset);
> + io_prep_preadv(iocbs, fd, qiov->iov, qiov->niov,
> + offset + laiocb->total_done);
> break;
> case QEMU_AIO_FLUSH:
> io_prep_fdsync(iocbs, fd);
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH for-11.0 3/3] io-uring: Resubmit tails of short writes
2026-03-18 15:32 ` [PATCH for-11.0 3/3] io-uring: Resubmit tails of short writes Hanna Czenczek
@ 2026-03-23 19:05 ` Kevin Wolf
0 siblings, 0 replies; 16+ messages in thread
From: Kevin Wolf @ 2026-03-23 19:05 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Aarushi Mehta, Stefan Hajnoczi,
Stefano Garzarella
Am 18.03.2026 um 16:32 hat Hanna Czenczek geschrieben:
> Short writes can happen, too, not just short reads. The difference to
> aio=native is that the kernel will actually retry the tail of short
> requests internally already -- so it is harder to reproduce. But if the
> tail of a short request returns an error to the kernel, we will see it
> in userspace still. To reproduce this, apply the following patch on top
> of the one shown in HEAD^ (again %s/escaped // to apply):
>
> escaped diff --git a/block/export/fuse.c b/block/export/fuse.c
> escaped index 67dc50a412..2b98489a32 100644
> escaped --- a/block/export/fuse.c
> escaped +++ b/block/export/fuse.c
> @@ -1059,8 +1059,15 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
> int64_t blk_len;
> void *buf;
> int ret;
> + static uint32_t error_size;
>
> - size = MIN(size, 4096);
> + if (error_size == size) {
> + error_size = 0;
> + return -EIO;
> + } else if (size > 4096) {
> + error_size = size - 4096;
> + size = 4096;
> + }
>
> /* Limited by max_read, should not happen */
> if (size > FUSE_MAX_READ_BYTES) {
> @@ -1111,8 +1118,15 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
> {
> int64_t blk_len;
> int ret;
> + static uint32_t error_size;
>
> - size = MIN(size, 4096);
> + if (error_size == size) {
> + error_size = 0;
> + return -EIO;
> + } else if (size > 4096) {
> + error_size = size - 4096;
> + size = 4096;
> + }
>
> QEMU_BUILD_BUG_ON(FUSE_MAX_WRITE_BYTES > BDRV_REQUEST_MAX_BYTES);
> /* Limited by max_write, should not happen */
>
> I know this is a bit artificial because to produce this, there must be
> an I/O error somewhere anyway, but if it does happen, qemu will
> understand it to mean ENOSPC for short writes, which is incorrect. So I
> believe we need to resubmit the tail to maybe have it succeed now, or at
> least get the correct error code.
>
> Reproducer as before:
> $ ./qemu-img create -f raw test.raw 8k
> Formatting 'test.raw', fmt=raw size=8192
> $ ./qemu-io -f raw -c 'write -P 42 0 8k' test.raw
> wrote 8192/8192 bytes at offset 0
> 8 KiB, 1 ops; 00.00 sec (64.804 MiB/sec and 8294.9003 ops/sec)
> $ hexdump -C test.raw
> 00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
> *
> 00002000
> $ storage-daemon/qemu-storage-daemon \
> --blockdev file,node-name=test,filename=test.raw \
> --export fuse,id=exp,node-name=test,mountpoint=test.raw,writable=true
>
> $ ./qemu-io --image-opts -c 'read -P 23 0 8k' \
> driver=file,filename=test.raw,cache.direct=on,aio=io_uring
> read 8192/8192 bytes at offset 0
> 8 KiB, 1 ops; 00.00 sec (58.481 MiB/sec and 7485.5342 ops/sec)
> $ ./qemu-io --image-opts -c 'write -P 23 0 8k' \
> driver=file,filename=test.raw,cache.direct=on,aio=io_uring
> write failed: No space left on device
> $ hexdump -C test.raw
> 00000000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 |................|
> *
> 00001000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
> *
> 00002000
>
> So short reads already work (because there is code for that), but short
> writes incorrectly produce ENOSPC. This patch fixes that by
> resubmitting not only the tail of short reads but short writes also.
>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> @@ -44,6 +44,10 @@ static void luring_prep_sqe(struct io_uring_sqe *sqe, void *opaque)
uint64_t offset = req->offset;
> int fd = req->fd;
> BdrvRequestFlags flags = req->flags;
>
> + if (req->resubmit_qiov.iov != NULL) {
> + qiov = &req->resubmit_qiov;
> + }
> +
We could have offset = req->offset + req->total_done again instead of
adding them in each case below, like I already commented on linux-aio.c.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH for-11.0 1/3] linux-aio: Put all parameters into qemu_laiocb
2026-03-23 17:04 ` Hanna Czenczek
@ 2026-03-23 19:10 ` Kevin Wolf
0 siblings, 0 replies; 16+ messages in thread
From: Kevin Wolf @ 2026-03-23 19:10 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Aarushi Mehta, Stefan Hajnoczi,
Stefano Garzarella
Am 23.03.2026 um 18:04 hat Hanna Czenczek geschrieben:
> On 23.03.26 18:02, Hanna Czenczek wrote:
> > On 23.03.26 17:36, Kevin Wolf wrote:
> > > Am 18.03.2026 um 16:32 hat Hanna Czenczek geschrieben:
> > > > Put all request parameters into the qemu_laiocb struct, which
> > > > will allow
> > > > re-submitting the tail of short reads/writes.
> > > >
> > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > ---
> > > > block/linux-aio.c | 35 ++++++++++++++++++++++-------------
> > > > 1 file changed, 22 insertions(+), 13 deletions(-)
> > > >
> > > > diff --git a/block/linux-aio.c b/block/linux-aio.c
> > > > index 53c3e9af8a..1f25339dc9 100644
> > > > --- a/block/linux-aio.c
> > > > +++ b/block/linux-aio.c
> > > > @@ -40,10 +40,15 @@ struct qemu_laiocb {
> > > > Coroutine *co;
> > > > LinuxAioState *ctx;
> > > > struct iocb iocb;
> > > > + int fd;
> > > > ssize_t ret;
> > > > + off_t offset;
> > > > size_t nbytes;
> > > > QEMUIOVector *qiov;
> > > > - bool is_read;
> > > > +
> > > > + int type;
> > > If you put fd and type next to each other, you'd avoid a hole in the
> > > struct.
> >
> > And I assume pulling dev_max_batch forward, after qiov, yes.
>
> Noticed while sending: I missed `.next` for some reason. So no reason to
> pull dev_max_batch forward, and I guess there will always be a hole, but I
> suppose it is still best to put all 32-bit ints next to each other (fd,
> type, flags) so filling that up in the future is more likely.
Ah, yes, I missed that flags is only 32 bits, too.
I'm sure that things could still be rearranged to be more optimal, and
I'm also sure it's premature optimisation. Sorry for the noise.
Kevin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH for-11.0 2/3] linux-aio: Resubmit tails of short reads/writes
2026-03-23 17:12 ` Kevin Wolf
@ 2026-03-24 8:12 ` Hanna Czenczek
2026-03-24 8:22 ` Hanna Czenczek
0 siblings, 1 reply; 16+ messages in thread
From: Hanna Czenczek @ 2026-03-24 8:12 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-block, qemu-devel, Aarushi Mehta, Stefan Hajnoczi,
Stefano Garzarella
On 23.03.26 18:12, Kevin Wolf wrote:
> Am 18.03.2026 um 16:32 hat Hanna Czenczek geschrieben:
>> Short reads/writes can happen. One way to reproduce them is via our
>> FUSE export, with the following diff applied (%s/escaped // to apply --
>> if you put plain diffs in commit messages, git-am will apply them, and I
>> would rather avoid breaking FUSE accidentally via this patch):
>>
>> escaped diff --git a/block/export/fuse.c b/block/export/fuse.c
>> escaped index a2a478d293..67dc50a412 100644
>> escaped --- a/block/export/fuse.c
>> escaped +++ b/block/export/fuse.c
>> @@ -828,7 +828,7 @@ static ssize_t coroutine_fn GRAPH_RDLOCK
>> fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
>> const struct fuse_init_in_compat *in)
>> {
>> - const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
>> + const uint32_t supported_flags = FUSE_ASYNC_READ;
>>
>> if (in->major != 7) {
>> error_report("FUSE major version mismatch: We have 7, but kernel has %"
>> @@ -1060,6 +1060,8 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
>> void *buf;
>> int ret;
>>
>> + size = MIN(size, 4096);
>> +
>> /* Limited by max_read, should not happen */
>> if (size > FUSE_MAX_READ_BYTES) {
>> return -EINVAL;
>> @@ -1110,6 +1112,8 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
>> int64_t blk_len;
>> int ret;
>>
>> + size = MIN(size, 4096);
>> +
>> QEMU_BUILD_BUG_ON(FUSE_MAX_WRITE_BYTES > BDRV_REQUEST_MAX_BYTES);
>> /* Limited by max_write, should not happen */
>> if (size > FUSE_MAX_WRITE_BYTES) {
>>
>> Then:
>> $ ./qemu-img create -f raw test.raw 8k
>> Formatting 'test.raw', fmt=raw size=8192
>> $ ./qemu-io -f raw -c 'write -P 42 0 8k' test.raw
>> wrote 8192/8192 bytes at offset 0
>> 8 KiB, 1 ops; 00.00 sec (64.804 MiB/sec and 8294.9003 ops/sec)
>> $ hexdump -C test.raw
>> 00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
>> *
>> 00002000
>>
>> With aio=threads, short I/O works:
>> $ storage-daemon/qemu-storage-daemon \
>> --blockdev file,node-name=test,filename=test.raw \
>> --export fuse,id=exp,node-name=test,mountpoint=test.raw,writable=true
>>
>> Other shell:
>> $ ./qemu-io --image-opts -c 'read -P 42 0 8k' \
>> driver=file,filename=test.raw,cache.direct=on,aio=threads
>> read 8192/8192 bytes at offset 0
>> 8 KiB, 1 ops; 00.00 sec (36.563 MiB/sec and 4680.0923 ops/sec)
>> $ ./qemu-io --image-opts -c 'write -P 23 0 8k' \
>> driver=file,filename=test.raw,cache.direct=on,aio=threads
>> wrote 8192/8192 bytes at offset 0
>> 8 KiB, 1 ops; 00.00 sec (35.995 MiB/sec and 4607.2970 ops/sec)
>> $ hexdump -C test.raw
>> 00000000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 |................|
>> *
>> 00002000
>>
>> But with aio=native, it does not:
>> $ ./qemu-io --image-opts -c 'read -P 23 0 8k' \
>> driver=file,filename=test.raw,cache.direct=on,aio=native
>> Pattern verification failed at offset 0, 8192 bytes
>> read 8192/8192 bytes at offset 0
>> 8 KiB, 1 ops; 00.00 sec (86.155 MiB/sec and 11027.7900 ops/sec)
>> $ ./qemu-io --image-opts -c 'write -P 42 0 8k' \
>> driver=file,filename=test.raw,cache.direct=on,aio=native
>> write failed: No space left on device
>> $ hexdump -C test.raw
>> 00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a |****************|
>> *
>> 00001000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 |................|
>> *
>> 00002000
>>
>> This patch fixes that.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>> block/linux-aio.c | 61 ++++++++++++++++++++++++++++++++++++++++-------
>> 1 file changed, 53 insertions(+), 8 deletions(-)
>>
>> diff --git a/block/linux-aio.c b/block/linux-aio.c
>> index 1f25339dc9..01621d4794 100644
>> --- a/block/linux-aio.c
>> +++ b/block/linux-aio.c
>> @@ -46,6 +46,10 @@ struct qemu_laiocb {
>> size_t nbytes;
>> QEMUIOVector *qiov;
>>
>> + /* For handling short reads/writes */
>> + size_t total_done;
>> + QEMUIOVector resubmit_qiov;
>> +
>> int type;
>> BdrvRequestFlags flags;
>> uint64_t dev_max_batch;
>> @@ -73,28 +77,61 @@ struct LinuxAioState {
>> };
>>
>> static void ioq_submit(LinuxAioState *s);
>> +static int laio_do_submit(struct qemu_laiocb *laiocb);
>>
>> static inline ssize_t io_event_ret(struct io_event *ev)
>> {
>> return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
>> }
>>
>> +/**
>> + * Retry tail of short requests.
>> + */
>> +static int laio_resubmit_short_io(struct qemu_laiocb *laiocb, size_t done)
>> +{
>> + QEMUIOVector *resubmit_qiov = &laiocb->resubmit_qiov;
>> +
>> + laiocb->total_done += done;
>> +
>> + if (!resubmit_qiov->iov) {
>> + qemu_iovec_init(resubmit_qiov, laiocb->qiov->niov);
>> + } else {
>> + qemu_iovec_reset(resubmit_qiov);
>> + }
>> + qemu_iovec_concat(resubmit_qiov, laiocb->qiov,
>> + laiocb->total_done, laiocb->nbytes - laiocb->total_done);
>> +
>> + return laio_do_submit(laiocb);
>> +}
>> +
>> /*
>> * Completes an AIO request.
>> */
>> static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
>> {
>> - int ret;
>> + ssize_t ret;
>>
>> ret = laiocb->ret;
>> if (ret != -ECANCELED) {
>> - if (ret == laiocb->nbytes) {
>> + if (ret == laiocb->nbytes - laiocb->total_done) {
>> ret = 0;
>> + } else if (ret > 0 && (laiocb->type == QEMU_AIO_READ ||
>> + laiocb->type == QEMU_AIO_WRITE)) {
>> + ret = laio_resubmit_short_io(laiocb, ret);
>> + if (!ret) {
>> + return;
>> + }
>> } else if (ret >= 0) {
>> - /* Short reads mean EOF, pad with zeros. */
>> + /*
>> + * For normal reads and writes, we only get here if ret == 0, which
>> + * means EOF for reads and ENOSPC for writes.
>> + * For zone-append, we get here with any ret >= 0, which we just
>> + * treat as ENOSPC, too (safer than resubmitting, probably, but not
>> + * 100 % clear).
>> + */
>> if (laiocb->type == QEMU_AIO_READ) {
>> - qemu_iovec_memset(laiocb->qiov, ret, 0,
>> - laiocb->qiov->size - ret);
>> + qemu_iovec_memset(laiocb->qiov, laiocb->total_done, 0,
>> + laiocb->qiov->size - laiocb->total_done);
>> } else {
>> ret = -ENOSPC;
>> }
>> @@ -102,6 +139,7 @@ static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
>> }
>>
>> laiocb->ret = ret;
>> + qemu_iovec_destroy(&laiocb->resubmit_qiov);
> Calling qemu_iovec_destroy() for a qiov that has potentially never been
> initialised feels a bit unsafe to me. It will work in practice for the
> current implementation, but maybe make this one conditional on
> laiocb->resubmit_qiov.iov, too? (Which is already making assumptions
> about the internals of QEMUIOVector, but that we'll have an iov after
> initialising the qiov with qemu_iovec_init() above will probably never
> change.)
Sure!
(For reference, the io-uring code calls qemu_iovec_destroy() the same
way, but I agree. I’ll make that change there, too.)
>> /*
>> * If the coroutine is already entered it must be in ioq_submit() and
>> @@ -380,23 +418,30 @@ static int laio_do_submit(struct qemu_laiocb *laiocb)
>> int fd = laiocb->fd;
>> off_t offset = laiocb->offset;
> I wonder if making it laiocb->offset + laiocb->total_done here wouldn't
> be more robust than having the addition in every call below.
I had it that way originally, but then decided to put it into the calls
below to show better which ones can actually be retried. Yes, the ones
not retried (zone_append) will always have total_done == 0, but I found
this clearer, personally.
Maybe I’ll make it two variables, original_offset and offset.
Hanna
>> + if (laiocb->resubmit_qiov.iov) {
>> + qiov = &laiocb->resubmit_qiov;
>> + }
>> +
>> switch (laiocb->type) {
>> case QEMU_AIO_WRITE:
>> #ifdef HAVE_IO_PREP_PWRITEV2
>> {
>> int laio_flags = (laiocb->flags & BDRV_REQ_FUA) ? RWF_DSYNC : 0;
>> - io_prep_pwritev2(iocbs, fd, qiov->iov, qiov->niov, offset, laio_flags);
>> + io_prep_pwritev2(iocbs, fd, qiov->iov, qiov->niov,
>> + offset + laiocb->total_done, laio_flags);
>> }
>> #else
>> assert(laiocb->flags == 0);
>> - io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
>> + io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov,
>> + offset + laiocb->total_done);
>> #endif
>> break;
>> case QEMU_AIO_ZONE_APPEND:
>> io_prep_pwritev(iocbs, fd, qiov->iov, qiov->niov, offset);
>> break;
>> case QEMU_AIO_READ:
>> - io_prep_preadv(iocbs, fd, qiov->iov, qiov->niov, offset);
>> + io_prep_preadv(iocbs, fd, qiov->iov, qiov->niov,
>> + offset + laiocb->total_done);
>> break;
>> case QEMU_AIO_FLUSH:
>> io_prep_fdsync(iocbs, fd);
> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH for-11.0 2/3] linux-aio: Resubmit tails of short reads/writes
2026-03-24 8:12 ` Hanna Czenczek
@ 2026-03-24 8:22 ` Hanna Czenczek
2026-03-24 9:22 ` Kevin Wolf
0 siblings, 1 reply; 16+ messages in thread
From: Hanna Czenczek @ 2026-03-24 8:22 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-block, qemu-devel, Aarushi Mehta, Stefan Hajnoczi,
Stefano Garzarella
On 24.03.26 09:12, Hanna Czenczek wrote:
> On 23.03.26 18:12, Kevin Wolf wrote:
>> Am 18.03.2026 um 16:32 hat Hanna Czenczek geschrieben:
>>> Short reads/writes can happen. One way to reproduce them is via our
>>> FUSE export, with the following diff applied (%s/escaped // to apply --
>>> if you put plain diffs in commit messages, git-am will apply them,
>>> and I
>>> would rather avoid breaking FUSE accidentally via this patch):
>>>
>>> escaped diff --git a/block/export/fuse.c b/block/export/fuse.c
>>> escaped index a2a478d293..67dc50a412 100644
>>> escaped --- a/block/export/fuse.c
>>> escaped +++ b/block/export/fuse.c
>>> @@ -828,7 +828,7 @@ static ssize_t coroutine_fn GRAPH_RDLOCK
>>> fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
>>> const struct fuse_init_in_compat *in)
>>> {
>>> - const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
>>> + const uint32_t supported_flags = FUSE_ASYNC_READ;
>>>
>>> if (in->major != 7) {
>>> error_report("FUSE major version mismatch: We have 7, but
>>> kernel has %"
>>> @@ -1060,6 +1060,8 @@ fuse_co_read(FuseExport *exp, void **bufptr,
>>> uint64_t offset, uint32_t size)
>>> void *buf;
>>> int ret;
>>>
>>> + size = MIN(size, 4096);
>>> +
>>> /* Limited by max_read, should not happen */
>>> if (size > FUSE_MAX_READ_BYTES) {
>>> return -EINVAL;
>>> @@ -1110,6 +1112,8 @@ fuse_co_write(FuseExport *exp, struct
>>> fuse_write_out *out,
>>> int64_t blk_len;
>>> int ret;
>>>
>>> + size = MIN(size, 4096);
>>> +
>>> QEMU_BUILD_BUG_ON(FUSE_MAX_WRITE_BYTES > BDRV_REQUEST_MAX_BYTES);
>>> /* Limited by max_write, should not happen */
>>> if (size > FUSE_MAX_WRITE_BYTES) {
>>>
>>> Then:
>>> $ ./qemu-img create -f raw test.raw 8k
>>> Formatting 'test.raw', fmt=raw size=8192
>>> $ ./qemu-io -f raw -c 'write -P 42 0 8k' test.raw
>>> wrote 8192/8192 bytes at offset 0
>>> 8 KiB, 1 ops; 00.00 sec (64.804 MiB/sec and 8294.9003 ops/sec)
>>> $ hexdump -C test.raw
>>> 00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
>>> |****************|
>>> *
>>> 00002000
>>>
>>> With aio=threads, short I/O works:
>>> $ storage-daemon/qemu-storage-daemon \
>>> --blockdev file,node-name=test,filename=test.raw \
>>> --export
>>> fuse,id=exp,node-name=test,mountpoint=test.raw,writable=true
>>>
>>> Other shell:
>>> $ ./qemu-io --image-opts -c 'read -P 42 0 8k' \
>>> driver=file,filename=test.raw,cache.direct=on,aio=threads
>>> read 8192/8192 bytes at offset 0
>>> 8 KiB, 1 ops; 00.00 sec (36.563 MiB/sec and 4680.0923 ops/sec)
>>> $ ./qemu-io --image-opts -c 'write -P 23 0 8k' \
>>> driver=file,filename=test.raw,cache.direct=on,aio=threads
>>> wrote 8192/8192 bytes at offset 0
>>> 8 KiB, 1 ops; 00.00 sec (35.995 MiB/sec and 4607.2970 ops/sec)
>>> $ hexdump -C test.raw
>>> 00000000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
>>> |................|
>>> *
>>> 00002000
>>>
>>> But with aio=native, it does not:
>>> $ ./qemu-io --image-opts -c 'read -P 23 0 8k' \
>>> driver=file,filename=test.raw,cache.direct=on,aio=native
>>> Pattern verification failed at offset 0, 8192 bytes
>>> read 8192/8192 bytes at offset 0
>>> 8 KiB, 1 ops; 00.00 sec (86.155 MiB/sec and 11027.7900 ops/sec)
>>> $ ./qemu-io --image-opts -c 'write -P 42 0 8k' \
>>> driver=file,filename=test.raw,cache.direct=on,aio=native
>>> write failed: No space left on device
>>> $ hexdump -C test.raw
>>> 00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
>>> |****************|
>>> *
>>> 00001000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
>>> |................|
>>> *
>>> 00002000
>>>
>>> This patch fixes that.
>>>
>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>> ---
>>> block/linux-aio.c | 61
>>> ++++++++++++++++++++++++++++++++++++++++-------
>>> 1 file changed, 53 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/block/linux-aio.c b/block/linux-aio.c
>>> index 1f25339dc9..01621d4794 100644
>>> --- a/block/linux-aio.c
>>> +++ b/block/linux-aio.c
>>> @@ -46,6 +46,10 @@ struct qemu_laiocb {
>>> size_t nbytes;
>>> QEMUIOVector *qiov;
>>> + /* For handling short reads/writes */
>>> + size_t total_done;
>>> + QEMUIOVector resubmit_qiov;
>>> +
>>> int type;
>>> BdrvRequestFlags flags;
>>> uint64_t dev_max_batch;
>>> @@ -73,28 +77,61 @@ struct LinuxAioState {
>>> };
>>> static void ioq_submit(LinuxAioState *s);
>>> +static int laio_do_submit(struct qemu_laiocb *laiocb);
>>> static inline ssize_t io_event_ret(struct io_event *ev)
>>> {
>>> return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
>>> }
>>> +/**
>>> + * Retry tail of short requests.
>>> + */
>>> +static int laio_resubmit_short_io(struct qemu_laiocb *laiocb,
>>> size_t done)
>>> +{
>>> + QEMUIOVector *resubmit_qiov = &laiocb->resubmit_qiov;
>>> +
>>> + laiocb->total_done += done;
>>> +
>>> + if (!resubmit_qiov->iov) {
>>> + qemu_iovec_init(resubmit_qiov, laiocb->qiov->niov);
>>> + } else {
>>> + qemu_iovec_reset(resubmit_qiov);
>>> + }
>>> + qemu_iovec_concat(resubmit_qiov, laiocb->qiov,
>>> + laiocb->total_done, laiocb->nbytes -
>>> laiocb->total_done);
>>> +
>>> + return laio_do_submit(laiocb);
>>> +}
>>> +
>>> /*
>>> * Completes an AIO request.
>>> */
>>> static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
>>> {
>>> - int ret;
>>> + ssize_t ret;
>>> ret = laiocb->ret;
>>> if (ret != -ECANCELED) {
>>> - if (ret == laiocb->nbytes) {
>>> + if (ret == laiocb->nbytes - laiocb->total_done) {
>>> ret = 0;
>>> + } else if (ret > 0 && (laiocb->type == QEMU_AIO_READ ||
>>> + laiocb->type == QEMU_AIO_WRITE)) {
>>> + ret = laio_resubmit_short_io(laiocb, ret);
>>> + if (!ret) {
>>> + return;
>>> + }
>>> } else if (ret >= 0) {
>>> - /* Short reads mean EOF, pad with zeros. */
>>> + /*
>>> + * For normal reads and writes, we only get here if ret
>>> == 0, which
>>> + * means EOF for reads and ENOSPC for writes.
>>> + * For zone-append, we get here with any ret >= 0,
>>> which we just
>>> + * treat as ENOSPC, too (safer than resubmitting,
>>> probably, but not
>>> + * 100 % clear).
>>> + */
>>> if (laiocb->type == QEMU_AIO_READ) {
>>> - qemu_iovec_memset(laiocb->qiov, ret, 0,
>>> - laiocb->qiov->size - ret);
>>> + qemu_iovec_memset(laiocb->qiov, laiocb->total_done, 0,
>>> + laiocb->qiov->size -
>>> laiocb->total_done);
>>> } else {
>>> ret = -ENOSPC;
>>> }
>>> @@ -102,6 +139,7 @@ static void qemu_laio_process_completion(struct
>>> qemu_laiocb *laiocb)
>>> }
>>> laiocb->ret = ret;
>>> + qemu_iovec_destroy(&laiocb->resubmit_qiov);
>> Calling qemu_iovec_destroy() for a qiov that has potentially never been
>> initialised feels a bit unsafe to me. It will work in practice for the
>> current implementation, but maybe make this one conditional on
>> laiocb->resubmit_qiov.iov, too? (Which is already making assumptions
>> about the internals of QEMUIOVector, but that we'll have an iov after
>> initialising the qiov with qemu_iovec_init() above will probably never
>> change.)
>
> Sure!
>
> (For reference, the io-uring code calls qemu_iovec_destroy() the same
> way, but I agree. I’ll make that change there, too.)
>
>>> /*
>>> * If the coroutine is already entered it must be in
>>> ioq_submit() and
>>> @@ -380,23 +418,30 @@ static int laio_do_submit(struct qemu_laiocb
>>> *laiocb)
>>> int fd = laiocb->fd;
>>> off_t offset = laiocb->offset;
>> I wonder if making it laiocb->offset + laiocb->total_done here wouldn't
>> be more robust than having the addition in every call below.
>
> I had it that way originally, but then decided to put it into the
> calls below to show better which ones can actually be retried. Yes,
> the ones not retried (zone_append) will always have total_done == 0,
> but I found this clearer, personally.
>
> Maybe I’ll make it two variables, original_offset and offset.
>
> Hanna
>
>>> + if (laiocb->resubmit_qiov.iov) {
>>> + qiov = &laiocb->resubmit_qiov;
>>> + }
>>> +
Now that I actually touch this, this here makes me wonder what my point
of “yes, sure, total_done will be 0 on no resubmission anyway, but!”
even was. If this piece of code unconditionally uses the resubmit_qiov
(when set up) for all requests, including zone_append, we might as well
include total_done in the offset, for all requests.
Hanna
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH for-11.0 2/3] linux-aio: Resubmit tails of short reads/writes
2026-03-24 8:22 ` Hanna Czenczek
@ 2026-03-24 9:22 ` Kevin Wolf
2026-03-24 10:04 ` Hanna Czenczek
0 siblings, 1 reply; 16+ messages in thread
From: Kevin Wolf @ 2026-03-24 9:22 UTC (permalink / raw)
To: Hanna Czenczek
Cc: qemu-block, qemu-devel, Aarushi Mehta, Stefan Hajnoczi,
Stefano Garzarella
Am 24.03.2026 um 09:22 hat Hanna Czenczek geschrieben:
> On 24.03.26 09:12, Hanna Czenczek wrote:
> > On 23.03.26 18:12, Kevin Wolf wrote:
> > > Am 18.03.2026 um 16:32 hat Hanna Czenczek geschrieben:
> > > > Short reads/writes can happen. One way to reproduce them is via our
> > > > FUSE export, with the following diff applied (%s/escaped // to apply --
> > > > if you put plain diffs in commit messages, git-am will apply
> > > > them, and I
> > > > would rather avoid breaking FUSE accidentally via this patch):
> > > >
> > > > escaped diff --git a/block/export/fuse.c b/block/export/fuse.c
> > > > escaped index a2a478d293..67dc50a412 100644
> > > > escaped --- a/block/export/fuse.c
> > > > escaped +++ b/block/export/fuse.c
> > > > @@ -828,7 +828,7 @@ static ssize_t coroutine_fn GRAPH_RDLOCK
> > > > fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
> > > > const struct fuse_init_in_compat *in)
> > > > {
> > > > - const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
> > > > + const uint32_t supported_flags = FUSE_ASYNC_READ;
> > > >
> > > > if (in->major != 7) {
> > > > error_report("FUSE major version mismatch: We have 7,
> > > > but kernel has %"
> > > > @@ -1060,6 +1060,8 @@ fuse_co_read(FuseExport *exp, void
> > > > **bufptr, uint64_t offset, uint32_t size)
> > > > void *buf;
> > > > int ret;
> > > >
> > > > + size = MIN(size, 4096);
> > > > +
> > > > /* Limited by max_read, should not happen */
> > > > if (size > FUSE_MAX_READ_BYTES) {
> > > > return -EINVAL;
> > > > @@ -1110,6 +1112,8 @@ fuse_co_write(FuseExport *exp, struct
> > > > fuse_write_out *out,
> > > > int64_t blk_len;
> > > > int ret;
> > > >
> > > > + size = MIN(size, 4096);
> > > > +
> > > > QEMU_BUILD_BUG_ON(FUSE_MAX_WRITE_BYTES > BDRV_REQUEST_MAX_BYTES);
> > > > /* Limited by max_write, should not happen */
> > > > if (size > FUSE_MAX_WRITE_BYTES) {
> > > >
> > > > Then:
> > > > $ ./qemu-img create -f raw test.raw 8k
> > > > Formatting 'test.raw', fmt=raw size=8192
> > > > $ ./qemu-io -f raw -c 'write -P 42 0 8k' test.raw
> > > > wrote 8192/8192 bytes at offset 0
> > > > 8 KiB, 1 ops; 00.00 sec (64.804 MiB/sec and 8294.9003 ops/sec)
> > > > $ hexdump -C test.raw
> > > > 00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
> > > > |****************|
> > > > *
> > > > 00002000
> > > >
> > > > With aio=threads, short I/O works:
> > > > $ storage-daemon/qemu-storage-daemon \
> > > > --blockdev file,node-name=test,filename=test.raw \
> > > > --export
> > > > fuse,id=exp,node-name=test,mountpoint=test.raw,writable=true
> > > >
> > > > Other shell:
> > > > $ ./qemu-io --image-opts -c 'read -P 42 0 8k' \
> > > > driver=file,filename=test.raw,cache.direct=on,aio=threads
> > > > read 8192/8192 bytes at offset 0
> > > > 8 KiB, 1 ops; 00.00 sec (36.563 MiB/sec and 4680.0923 ops/sec)
> > > > $ ./qemu-io --image-opts -c 'write -P 23 0 8k' \
> > > > driver=file,filename=test.raw,cache.direct=on,aio=threads
> > > > wrote 8192/8192 bytes at offset 0
> > > > 8 KiB, 1 ops; 00.00 sec (35.995 MiB/sec and 4607.2970 ops/sec)
> > > > $ hexdump -C test.raw
> > > > 00000000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
> > > > |................|
> > > > *
> > > > 00002000
> > > >
> > > > But with aio=native, it does not:
> > > > $ ./qemu-io --image-opts -c 'read -P 23 0 8k' \
> > > > driver=file,filename=test.raw,cache.direct=on,aio=native
> > > > Pattern verification failed at offset 0, 8192 bytes
> > > > read 8192/8192 bytes at offset 0
> > > > 8 KiB, 1 ops; 00.00 sec (86.155 MiB/sec and 11027.7900 ops/sec)
> > > > $ ./qemu-io --image-opts -c 'write -P 42 0 8k' \
> > > > driver=file,filename=test.raw,cache.direct=on,aio=native
> > > > write failed: No space left on device
> > > > $ hexdump -C test.raw
> > > > 00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
> > > > |****************|
> > > > *
> > > > 00001000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
> > > > |................|
> > > > *
> > > > 00002000
> > > >
> > > > This patch fixes that.
> > > >
> > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > ---
> > > > block/linux-aio.c | 61
> > > > ++++++++++++++++++++++++++++++++++++++++-------
> > > > 1 file changed, 53 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/block/linux-aio.c b/block/linux-aio.c
> > > > index 1f25339dc9..01621d4794 100644
> > > > --- a/block/linux-aio.c
> > > > +++ b/block/linux-aio.c
> > > > @@ -46,6 +46,10 @@ struct qemu_laiocb {
> > > > size_t nbytes;
> > > > QEMUIOVector *qiov;
> > > > + /* For handling short reads/writes */
> > > > + size_t total_done;
> > > > + QEMUIOVector resubmit_qiov;
> > > > +
> > > > int type;
> > > > BdrvRequestFlags flags;
> > > > uint64_t dev_max_batch;
> > > > @@ -73,28 +77,61 @@ struct LinuxAioState {
> > > > };
> > > > static void ioq_submit(LinuxAioState *s);
> > > > +static int laio_do_submit(struct qemu_laiocb *laiocb);
> > > > static inline ssize_t io_event_ret(struct io_event *ev)
> > > > {
> > > > return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
> > > > }
> > > > +/**
> > > > + * Retry tail of short requests.
> > > > + */
> > > > +static int laio_resubmit_short_io(struct qemu_laiocb *laiocb,
> > > > size_t done)
> > > > +{
> > > > + QEMUIOVector *resubmit_qiov = &laiocb->resubmit_qiov;
> > > > +
> > > > + laiocb->total_done += done;
> > > > +
> > > > + if (!resubmit_qiov->iov) {
> > > > + qemu_iovec_init(resubmit_qiov, laiocb->qiov->niov);
> > > > + } else {
> > > > + qemu_iovec_reset(resubmit_qiov);
> > > > + }
> > > > + qemu_iovec_concat(resubmit_qiov, laiocb->qiov,
> > > > + laiocb->total_done, laiocb->nbytes -
> > > > laiocb->total_done);
> > > > +
> > > > + return laio_do_submit(laiocb);
> > > > +}
> > > > +
> > > > /*
> > > > * Completes an AIO request.
> > > > */
> > > > static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
> > > > {
> > > > - int ret;
> > > > + ssize_t ret;
> > > > ret = laiocb->ret;
> > > > if (ret != -ECANCELED) {
> > > > - if (ret == laiocb->nbytes) {
> > > > + if (ret == laiocb->nbytes - laiocb->total_done) {
> > > > ret = 0;
> > > > + } else if (ret > 0 && (laiocb->type == QEMU_AIO_READ ||
> > > > + laiocb->type == QEMU_AIO_WRITE)) {
> > > > + ret = laio_resubmit_short_io(laiocb, ret);
> > > > + if (!ret) {
> > > > + return;
> > > > + }
> > > > } else if (ret >= 0) {
> > > > - /* Short reads mean EOF, pad with zeros. */
> > > > + /*
> > > > + * For normal reads and writes, we only get here if
> > > > ret == 0, which
> > > > + * means EOF for reads and ENOSPC for writes.
> > > > + * For zone-append, we get here with any ret >= 0,
> > > > which we just
> > > > + * treat as ENOSPC, too (safer than resubmitting,
> > > > probably, but not
> > > > + * 100 % clear).
> > > > + */
> > > > if (laiocb->type == QEMU_AIO_READ) {
> > > > - qemu_iovec_memset(laiocb->qiov, ret, 0,
> > > > - laiocb->qiov->size - ret);
> > > > + qemu_iovec_memset(laiocb->qiov, laiocb->total_done, 0,
> > > > + laiocb->qiov->size -
> > > > laiocb->total_done);
> > > > } else {
> > > > ret = -ENOSPC;
> > > > }
> > > > @@ -102,6 +139,7 @@ static void
> > > > qemu_laio_process_completion(struct qemu_laiocb *laiocb)
> > > > }
> > > > laiocb->ret = ret;
> > > > + qemu_iovec_destroy(&laiocb->resubmit_qiov);
> > > Calling qemu_iovec_destroy() for a qiov that has potentially never been
> > > initialised feels a bit unsafe to me. It will work in practice for the
> > > current implementation, but maybe make this one conditional on
> > > laiocb->resubmit_qiov.iov, too? (Which is already making assumptions
> > > about the internals of QEMUIOVector, but that we'll have an iov after
> > > initialising the qiov with qemu_iovec_init() above will probably never
> > > change.)
> >
> > Sure!
> >
> > (For reference, the io-uring code calls qemu_iovec_destroy() the same
> > way, but I agree. I’ll make that change there, too.)
> >
> > > > /*
> > > > * If the coroutine is already entered it must be in
> > > > ioq_submit() and
> > > > @@ -380,23 +418,30 @@ static int laio_do_submit(struct
> > > > qemu_laiocb *laiocb)
> > > > int fd = laiocb->fd;
> > > > off_t offset = laiocb->offset;
> > > I wonder if making it laiocb->offset + laiocb->total_done here wouldn't
> > > be more robust than having the addition in every call below.
> >
> > I had it that way originally, but then decided to put it into the calls
> > below to show better which ones can actually be retried. Yes, the ones
> > not retried (zone_append) will always have total_done == 0, but I found
> > this clearer, personally.
> >
> > Maybe I’ll make it two variables, original_offset and offset.
> >
> > Hanna
> >
> > > > + if (laiocb->resubmit_qiov.iov) {
> > > > + qiov = &laiocb->resubmit_qiov;
> > > > + }
> > > > +
>
> Now that I actually touch this, this here makes me wonder what my point of
> “yes, sure, total_done will be 0 on no resubmission anyway, but!” even was.
> If this piece of code unconditionally uses the resubmit_qiov (when set up)
> for all requests, including zone_append, we might as well include total_done
> in the offset, for all requests.
I actually considered if I should suggest 'offset += laiocb->total_done'
in this if block instead of doing it right at initialisation, but then
thought what's the point, it's the same in practice anyway.
If you want to make the zone_append case clearer, we could add an
assertion that it's 0 there, if that feels better to you.
Kevin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH for-11.0 2/3] linux-aio: Resubmit tails of short reads/writes
2026-03-24 9:22 ` Kevin Wolf
@ 2026-03-24 10:04 ` Hanna Czenczek
0 siblings, 0 replies; 16+ messages in thread
From: Hanna Czenczek @ 2026-03-24 10:04 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-block, qemu-devel, Aarushi Mehta, Stefan Hajnoczi,
Stefano Garzarella
On 24.03.26 10:22, Kevin Wolf wrote:
> Am 24.03.2026 um 09:22 hat Hanna Czenczek geschrieben:
>> On 24.03.26 09:12, Hanna Czenczek wrote:
>>> On 23.03.26 18:12, Kevin Wolf wrote:
>>>> Am 18.03.2026 um 16:32 hat Hanna Czenczek geschrieben:
>>>>> Short reads/writes can happen. One way to reproduce them is via our
>>>>> FUSE export, with the following diff applied (%s/escaped // to apply --
>>>>> if you put plain diffs in commit messages, git-am will apply
>>>>> them, and I
>>>>> would rather avoid breaking FUSE accidentally via this patch):
>>>>>
>>>>> escaped diff --git a/block/export/fuse.c b/block/export/fuse.c
>>>>> escaped index a2a478d293..67dc50a412 100644
>>>>> escaped --- a/block/export/fuse.c
>>>>> escaped +++ b/block/export/fuse.c
>>>>> @@ -828,7 +828,7 @@ static ssize_t coroutine_fn GRAPH_RDLOCK
>>>>> fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
>>>>> const struct fuse_init_in_compat *in)
>>>>> {
>>>>> - const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
>>>>> + const uint32_t supported_flags = FUSE_ASYNC_READ;
>>>>>
>>>>> if (in->major != 7) {
>>>>> error_report("FUSE major version mismatch: We have 7,
>>>>> but kernel has %"
>>>>> @@ -1060,6 +1060,8 @@ fuse_co_read(FuseExport *exp, void
>>>>> **bufptr, uint64_t offset, uint32_t size)
>>>>> void *buf;
>>>>> int ret;
>>>>>
>>>>> + size = MIN(size, 4096);
>>>>> +
>>>>> /* Limited by max_read, should not happen */
>>>>> if (size > FUSE_MAX_READ_BYTES) {
>>>>> return -EINVAL;
>>>>> @@ -1110,6 +1112,8 @@ fuse_co_write(FuseExport *exp, struct
>>>>> fuse_write_out *out,
>>>>> int64_t blk_len;
>>>>> int ret;
>>>>>
>>>>> + size = MIN(size, 4096);
>>>>> +
>>>>> QEMU_BUILD_BUG_ON(FUSE_MAX_WRITE_BYTES > BDRV_REQUEST_MAX_BYTES);
>>>>> /* Limited by max_write, should not happen */
>>>>> if (size > FUSE_MAX_WRITE_BYTES) {
>>>>>
>>>>> Then:
>>>>> $ ./qemu-img create -f raw test.raw 8k
>>>>> Formatting 'test.raw', fmt=raw size=8192
>>>>> $ ./qemu-io -f raw -c 'write -P 42 0 8k' test.raw
>>>>> wrote 8192/8192 bytes at offset 0
>>>>> 8 KiB, 1 ops; 00.00 sec (64.804 MiB/sec and 8294.9003 ops/sec)
>>>>> $ hexdump -C test.raw
>>>>> 00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
>>>>> |****************|
>>>>> *
>>>>> 00002000
>>>>>
>>>>> With aio=threads, short I/O works:
>>>>> $ storage-daemon/qemu-storage-daemon \
>>>>> --blockdev file,node-name=test,filename=test.raw \
>>>>> --export
>>>>> fuse,id=exp,node-name=test,mountpoint=test.raw,writable=true
>>>>>
>>>>> Other shell:
>>>>> $ ./qemu-io --image-opts -c 'read -P 42 0 8k' \
>>>>> driver=file,filename=test.raw,cache.direct=on,aio=threads
>>>>> read 8192/8192 bytes at offset 0
>>>>> 8 KiB, 1 ops; 00.00 sec (36.563 MiB/sec and 4680.0923 ops/sec)
>>>>> $ ./qemu-io --image-opts -c 'write -P 23 0 8k' \
>>>>> driver=file,filename=test.raw,cache.direct=on,aio=threads
>>>>> wrote 8192/8192 bytes at offset 0
>>>>> 8 KiB, 1 ops; 00.00 sec (35.995 MiB/sec and 4607.2970 ops/sec)
>>>>> $ hexdump -C test.raw
>>>>> 00000000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
>>>>> |................|
>>>>> *
>>>>> 00002000
>>>>>
>>>>> But with aio=native, it does not:
>>>>> $ ./qemu-io --image-opts -c 'read -P 23 0 8k' \
>>>>> driver=file,filename=test.raw,cache.direct=on,aio=native
>>>>> Pattern verification failed at offset 0, 8192 bytes
>>>>> read 8192/8192 bytes at offset 0
>>>>> 8 KiB, 1 ops; 00.00 sec (86.155 MiB/sec and 11027.7900 ops/sec)
>>>>> $ ./qemu-io --image-opts -c 'write -P 42 0 8k' \
>>>>> driver=file,filename=test.raw,cache.direct=on,aio=native
>>>>> write failed: No space left on device
>>>>> $ hexdump -C test.raw
>>>>> 00000000 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
>>>>> |****************|
>>>>> *
>>>>> 00001000 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17
>>>>> |................|
>>>>> *
>>>>> 00002000
>>>>>
>>>>> This patch fixes that.
>>>>>
>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>> ---
>>>>> block/linux-aio.c | 61
>>>>> ++++++++++++++++++++++++++++++++++++++++-------
>>>>> 1 file changed, 53 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/block/linux-aio.c b/block/linux-aio.c
>>>>> index 1f25339dc9..01621d4794 100644
>>>>> --- a/block/linux-aio.c
>>>>> +++ b/block/linux-aio.c
>>>>> @@ -46,6 +46,10 @@ struct qemu_laiocb {
>>>>> size_t nbytes;
>>>>> QEMUIOVector *qiov;
>>>>> + /* For handling short reads/writes */
>>>>> + size_t total_done;
>>>>> + QEMUIOVector resubmit_qiov;
>>>>> +
>>>>> int type;
>>>>> BdrvRequestFlags flags;
>>>>> uint64_t dev_max_batch;
>>>>> @@ -73,28 +77,61 @@ struct LinuxAioState {
>>>>> };
>>>>> static void ioq_submit(LinuxAioState *s);
>>>>> +static int laio_do_submit(struct qemu_laiocb *laiocb);
>>>>> static inline ssize_t io_event_ret(struct io_event *ev)
>>>>> {
>>>>> return (ssize_t)(((uint64_t)ev->res2 << 32) | ev->res);
>>>>> }
>>>>> +/**
>>>>> + * Retry tail of short requests.
>>>>> + */
>>>>> +static int laio_resubmit_short_io(struct qemu_laiocb *laiocb,
>>>>> size_t done)
>>>>> +{
>>>>> + QEMUIOVector *resubmit_qiov = &laiocb->resubmit_qiov;
>>>>> +
>>>>> + laiocb->total_done += done;
>>>>> +
>>>>> + if (!resubmit_qiov->iov) {
>>>>> + qemu_iovec_init(resubmit_qiov, laiocb->qiov->niov);
>>>>> + } else {
>>>>> + qemu_iovec_reset(resubmit_qiov);
>>>>> + }
>>>>> + qemu_iovec_concat(resubmit_qiov, laiocb->qiov,
>>>>> + laiocb->total_done, laiocb->nbytes -
>>>>> laiocb->total_done);
>>>>> +
>>>>> + return laio_do_submit(laiocb);
>>>>> +}
>>>>> +
>>>>> /*
>>>>> * Completes an AIO request.
>>>>> */
>>>>> static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
>>>>> {
>>>>> - int ret;
>>>>> + ssize_t ret;
>>>>> ret = laiocb->ret;
>>>>> if (ret != -ECANCELED) {
>>>>> - if (ret == laiocb->nbytes) {
>>>>> + if (ret == laiocb->nbytes - laiocb->total_done) {
>>>>> ret = 0;
>>>>> + } else if (ret > 0 && (laiocb->type == QEMU_AIO_READ ||
>>>>> + laiocb->type == QEMU_AIO_WRITE)) {
>>>>> + ret = laio_resubmit_short_io(laiocb, ret);
>>>>> + if (!ret) {
>>>>> + return;
>>>>> + }
>>>>> } else if (ret >= 0) {
>>>>> - /* Short reads mean EOF, pad with zeros. */
>>>>> + /*
>>>>> + * For normal reads and writes, we only get here if
>>>>> ret == 0, which
>>>>> + * means EOF for reads and ENOSPC for writes.
>>>>> + * For zone-append, we get here with any ret >= 0,
>>>>> which we just
>>>>> + * treat as ENOSPC, too (safer than resubmitting,
>>>>> probably, but not
>>>>> + * 100 % clear).
>>>>> + */
>>>>> if (laiocb->type == QEMU_AIO_READ) {
>>>>> - qemu_iovec_memset(laiocb->qiov, ret, 0,
>>>>> - laiocb->qiov->size - ret);
>>>>> + qemu_iovec_memset(laiocb->qiov, laiocb->total_done, 0,
>>>>> + laiocb->qiov->size -
>>>>> laiocb->total_done);
>>>>> } else {
>>>>> ret = -ENOSPC;
>>>>> }
>>>>> @@ -102,6 +139,7 @@ static void
>>>>> qemu_laio_process_completion(struct qemu_laiocb *laiocb)
>>>>> }
>>>>> laiocb->ret = ret;
>>>>> + qemu_iovec_destroy(&laiocb->resubmit_qiov);
>>>> Calling qemu_iovec_destroy() for a qiov that has potentially never been
>>>> initialised feels a bit unsafe to me. It will work in practice for the
>>>> current implementation, but maybe make this one conditional on
>>>> laiocb->resubmit_qiov.iov, too? (Which is already making assumptions
>>>> about the internals of QEMUIOVector, but that we'll have an iov after
>>>> initialising the qiov with qemu_iovec_init() above will probably never
>>>> change.)
>>> Sure!
>>>
>>> (For reference, the io-uring code calls qemu_iovec_destroy() the same
>>> way, but I agree. I’ll make that change there, too.)
>>>
>>>>> /*
>>>>> * If the coroutine is already entered it must be in
>>>>> ioq_submit() and
>>>>> @@ -380,23 +418,30 @@ static int laio_do_submit(struct
>>>>> qemu_laiocb *laiocb)
>>>>> int fd = laiocb->fd;
>>>>> off_t offset = laiocb->offset;
>>>> I wonder if making it laiocb->offset + laiocb->total_done here wouldn't
>>>> be more robust than having the addition in every call below.
>>> I had it that way originally, but then decided to put it into the calls
>>> below to show better which ones can actually be retried. Yes, the ones
>>> not retried (zone_append) will always have total_done == 0, but I found
>>> this clearer, personally.
>>>
>>> Maybe I’ll make it two variables, original_offset and offset.
>>>
>>> Hanna
>>>
>>>>> + if (laiocb->resubmit_qiov.iov) {
>>>>> + qiov = &laiocb->resubmit_qiov;
>>>>> + }
>>>>> +
>> Now that I actually touch this, this here makes me wonder what my point of
>> “yes, sure, total_done will be 0 on no resubmission anyway, but!” even was.
>> If this piece of code unconditionally uses the resubmit_qiov (when set up)
>> for all requests, including zone_append, we might as well include total_done
>> in the offset, for all requests.
> I actually considered if I should suggest 'offset += laiocb->total_done'
> in this if block instead of doing it right at initialisation, but then
> thought what's the point, it's the same in practice anyway.
>
> If you want to make the zone_append case clearer, we could add an
> assertion that it's 0 there, if that feels better to you.
No, I think it’s ok as-is (with offset including total_done). I’ve sent
a v2.
Hanna
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2026-03-24 10:04 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-18 15:32 [PATCH for-11.0 0/3] linux-aio/io-uring: Resubmit tails of short requests Hanna Czenczek
2026-03-18 15:32 ` [PATCH for-11.0 1/3] linux-aio: Put all parameters into qemu_laiocb Hanna Czenczek
2026-03-23 16:36 ` Kevin Wolf
2026-03-23 17:02 ` Hanna Czenczek
2026-03-23 17:04 ` Hanna Czenczek
2026-03-23 19:10 ` Kevin Wolf
2026-03-18 15:32 ` [PATCH for-11.0 2/3] linux-aio: Resubmit tails of short reads/writes Hanna Czenczek
2026-03-23 17:12 ` Kevin Wolf
2026-03-24 8:12 ` Hanna Czenczek
2026-03-24 8:22 ` Hanna Czenczek
2026-03-24 9:22 ` Kevin Wolf
2026-03-24 10:04 ` Hanna Czenczek
2026-03-18 15:32 ` [PATCH for-11.0 3/3] io-uring: Resubmit tails of short writes Hanna Czenczek
2026-03-23 19:05 ` Kevin Wolf
2026-03-23 16:28 ` [PATCH for-11.0 0/3] linux-aio/io-uring: Resubmit tails of short requests Kevin Wolf
2026-03-23 16:59 ` Hanna Czenczek
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox